Mobile App Review Insights through LDA Topic Modeling
Abstract
In this project, I analyze customer needs through text mining of healthcare app reviews and based on this, I propose a design strategy for healthcare apps. I have collected 34,230 reviews from 10 healthcare apps in the Google Play Store. I performed LDA topic modeling to analyze customer needs in-depth through.
Dataset
1. User reviews
Collected 34,320 reviews from 10 health care apps in the Google Play Store
Data Collection Method: Crawiling the reviews on Google Play store
2. Preprocessing
The dataset used for preprocessing purposes is as follows:
- List for word substitution
Since LDA topic modeling provides results based on the most frequent vocabulary, unifying words with the same meaning into a single word is an effective way to perform semantic analysis on text. For example, ‘iPhone’ and ‘galaxy s8’ are both the same word as ‘smartphone’. A human can determine that the words all have the same meaning, but the computer recognizes them as all different words. This may cause keywords to be missed because the number of occurrences of a word with a specific meaning is counted less as it is used as a different word even though it is a frequent vocabulary. Therefore, in text mining techniques where the frequency of occurrence of words is important, such as LDA topic modeling, prior word replacement is one of the ways to increase the effectiveness of data analysis.
- List for stopword
A stopword is a word that appears frequently in text mining, but it is a predicate or investigation that is far from the user’s reactions or opinions. They have nothing to do with user experience. Therefore, it is necessary to organize these stopwords well in the preprocessing stage.
LDA topic modeling concepts
Topic Modeling is a text mining methodology that finds key topics in text-based document data. In particular, Latent Dirichlet Allocation (LDA) is the most representative algorithm for topic modeling. Specifically, LDA topic modeling analyzes which topics in a document and at what ratio by analyzing a large amount of document data through a probability-based modeling technique (Blei et al., 2003). In addition, since it provides information on what keywords are configured for each topic, it has an effective advantage in deriving insights through keyword combinations. Recently, research has been actively conducted in various fields, such as automatically classifying similar topics on SNS through LDA topic modeling or deriving customer needs by analyzing airline online reviews (Lu et al., 2013, Kwon et al., 2021).
LDA topic modeling visualization
In this project, considering the review rating is out of 5, I classified 4-5 as positive reviews and 1-2 reviews as negative reviews. LDA topic modeling will be performed and visualized for each review group that received positive/negative ratings, as shown in Figures 1 and 2 below.
Code
1. Google drive mount
from google.colab import drive
drive.mount('/content/gdrive')
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
2. Package install and import
!pip install pyLDAvis==2.1.2
import numpy as np
import pandas as pd
import warnings # ignore warning msg
warnings.filterwarnings(action='ignore')
# Use NLTK
import nltk
import pickle
import re
nltk.download('all')
from tqdm import tqdm # work process visualization
import re # Regular expression package for string
from gensim import corpora # word frequency counting package
import gensim #LDA
import pyLDAvis
import pyLDAvis.gensim
from collections import Counter
3. Load dataset
dataset_raw = pd.read_excel('/content/gdrive/MyDrive/NLP-final-project/dataset_raw.xlsx')
dataset_raw.head()
app | review | rating | |
---|---|---|---|
0 | FitCoach: Fitness Coach & Diet | Its a nice app but so many features don't actu... | 3 |
1 | FitCoach: Fitness Coach & Diet | Deceptive. Not at all as advertised. Annoying ... | 1 |
2 | FitCoach: Fitness Coach & Diet | Not the app shown in the Facebook ads. I signe... | 1 |
3 | FitCoach: Fitness Coach & Diet | Updated review. The app is good, the range of ... | 4 |
4 | FitCoach: Fitness Coach & Diet | Was alright in the beginning. You can't change... | 2 |
4. Data exploration
dataset_raw.info
<bound method DataFrame.info of app \
0 FitCoach: Fitness Coach & Diet
1 FitCoach: Fitness Coach & Diet
2 FitCoach: Fitness Coach & Diet
3 FitCoach: Fitness Coach & Diet
4 FitCoach: Fitness Coach & Diet
... ...
34315 8fit Workouts & Meal Planner
34316 8fit Workouts & Meal Planner
34317 8fit Workouts & Meal Planner
34318 8fit Workouts & Meal Planner
34319 8fit Workouts & Meal Planner
review rating
0 Its a nice app but so many features don't actu... 3
1 Deceptive. Not at all as advertised. Annoying ... 1
2 Not the app shown in the Facebook ads. I signe... 1
3 Updated review. The app is good, the range of ... 4
4 Was alright in the beginning. You can't change... 2
... ... ...
34315 It's the perfect app for a perfect workout. 5
34316 good app. i like the meal plans and workouts. 5
34317 Easy and practical to use. Love the variety of... 5
34318 It helps ME and MY body 5
34319 Good app so far with the first exercise. 5
[34320 rows x 3 columns]> ### 5. Data preprocessing
1. Check for missing
dataset_raw.isnull().sum()
app 0
review 2
rating 0
dtype: int64 #### 2. Remove missing values ``` python # axis = 0: remove missing value's row dataset = dataset_raw.dropna(axis = 0) dataset.isnull().sum() ```
app 0
review 0
rating 0
dtype: int64
3. Load dictionary for preprocessing
stopword_list = pd.read_excel('/content/gdrive/MyDrive/NLP-final-project/stopword_list.xlsx')
stopword_list.tail()
stopword | |
---|---|
147 | really |
148 | great |
149 | nice |
150 | like |
151 | love |
replace_list = pd.read_excel('/content/gdrive/MyDrive/NLP-final-project/replace_list.xlsx')
replace_list.head()
before_replacement | after_replacement | |
---|---|---|
0 | cell phone | phone |
1 | smartphone | phone |
2 | iphone | phone |
3 | galaxy | phone |
4 | ipad | phone |
4. Word substitution
def replace_word(review):
for i in range(len(replace_list['before_replacement'])):
try:
# Perform data replacement only when there is a word to be replaced
if replace_list['before_replacement'][i] in review:
review = review.replace(replace_list['before_replacement'][i], replace_list['after_replacement'][i])
except Exception as e:
print(f"Error: {e}")
return review
dataset['review_prep'] = ''
review_replaced_list = []
for review in tqdm(dataset['review']):
review_replaced = replace_word(str(review)).lower() #lower case
review_replaced_list.append(review_replaced)
dataset['review_prep'] = review_replaced_list
dataset.head()
app | review | rating | review_prep | |
---|---|---|---|---|
0 | FitCoach: Fitness Coach & Diet | Its a nice app but so many features don't actu... | 3 | its a nice application but so many features do... |
1 | FitCoach: Fitness Coach & Diet | Deceptive. Not at all as advertised. Annoying ... | 1 | deceptive. not at all as advertised. annoying ... |
2 | FitCoach: Fitness Coach & Diet | Not the app shown in the Facebook ads. I signe... | 1 | not the application shown in the facebook ads.... |
3 | FitCoach: Fitness Coach & Diet | Updated review. The app is good, the range of ... | 4 | updated review. the application is good, the r... |
4 | FitCoach: Fitness Coach & Diet | Was alright in the beginning. You can't change... | 2 | was alright in the beginning. you can't change... |
5. Remove non-English text.
review_removed = list(map(lambda review: re.sub('[^a-zA-Z ]', '', review), dataset['review_prep']))
dataset['review_prep'] = review_removed
6. Separation of data based on rating
The Google Play Store has a rating of 5 out of 5. Therefore, in thisproject, 4-5 were classified as positive reviews, and 1-2 were classifiedas negative reviews. This is to distinguish between positive and negative reviews related to the experience of using the service.
# Positive review (4, 5 out of 5)
review_pos = dataset[(4 == dataset['rating']) | (dataset['rating'] == 5)]['review_prep']
# Negative review (1, 2 out of 5)
review_neg = dataset[(1 == dataset['rating']) | (dataset['rating'] == 2)]['review_prep']
review_pos
3 updated review the application is good the ran...
13 great way to stay on track with lots of variet...
18 despite the negative reviews i read i find the...
20 update after posting review my next workout di...
21 i loved this application while using it it kic...
...
34315 its the perfect application for a perfect workout
34316 good application i like the meal plans and wor...
34317 easy and practical to use love the variety of ...
34318 it helps me and my body
34319 good application so far with the first exercise
Name: review_prep, Length: 21017, dtype: object
7. Tokenization
Nouns are a key morpheme to understand the context in a sentence andhave the advantage of being able to easily identify frequent words, only nouns are extracted from the review.
review_tokenized_pos = list(map(lambda review: nltk.word_tokenize(review), review_pos))
review_tokenized_neg = list(map(lambda review: nltk.word_tokenize(review), review_neg))
8. Remove stopwords
def remove_stopword(tokens):
review_removed_stopword = []
for token in tokens:
# When the number of characters in the token is 2 or more
if 1 < len(token):
# Include as review data for analytics only if the token is not a stopword
if token not in list(stopword_list['stopword']):
review_removed_stopword.append(token)
return review_removed_stopword
review_removed_stopword_pos = list(map(lambda tokens : remove_stopword(tokens), review_tokenized_pos))
review_removed_stopword_neg = list(map(lambda tokens : remove_stopword(tokens), review_tokenized_neg))
- Select a specific range of reviews
In general, the longer the review, the more likely it will contain user feedback, such as user experience or technical issues. However, reviews that are rather long may have difficulties in identifying topics or extracting features using combinations of words in the review (Vasa et al.,2012). Therefore, in this project, only reviews with 3 or more and 15 or less nouns extracted from each review were used for analysis.
MIN_TOKEN_NUMBER = 3 # Min MAX_TOKEN_NUMBER = 15 # Max
def select_review(review_removed_stopword): review_prep = [] for tokens in review_removed_stopword: if MIN_TOKEN_NUMBER <= len(tokens) <= MAX_TOKEN_NUMBER: review_prep.append(tokens) return review_prep
review_prep_pos = select_review(review_removed_stopword_pos) review_prep_neg = select_review(review_removed_stopword_neg)
- Check the preprocessing result
review_num_pos = len(review_prep_pos)
review_num_neg = len(review_prep_neg)
review_num_tot = review_num_pos + review_num_neg
print(f"Total: {review_num_tot}")
print(f"Positive Reviews: {review_num_pos}({(review_num_pos/review_num_tot)*100:.2f}%)")
print(f"Negative Reviews: {review_num_neg}({(review_num_neg/review_num_tot)*100:.2f}%)")
Total: 13167
Positive Reviews: 10306(78.27%)
Negative Reviews: 2861(21.73%)
6. LDA Topic Modeling
1. Hyperparameter tuning
NUM_TOPICS = 10
# passes: The same concept as the epoch, determining the number of times to train the model with the entire corpus
PASSES = 15
2. Model training
def lda_modeling(review_prep):
# Word encoding and frequency counting
dictionary = corpora.Dictionary(review_prep)
corpus = [dictionary.doc2bow(review) for review in review_prep]
# LDA model training
model = gensim.models.ldamodel.LdaModel(corpus,
num_topics = NUM_TOPICS,
id2word = dictionary,
passes = PASSES)
return model, corpus, dictionary
3. Word composition output function by topic
def print_topic_prop(topics, RATING):
topic_values = []
for topic in topics:
topic_value = topic[1]
topic_values.append(topic_value)
topic_prop = pd.DataFrame({"topic_num" : list(range(1, NUM_TOPICS + 1)), "word_prop": topic_values})
topic_prop.to_excel('/content/gdrive/MyDrive/NLP-final-project/result/topic_prop_' + RATING + '.xlsx')
display(topic_prop)
4. Visualization function
def lda_visualize(model, corpus, dictionary, RATING):
pyLDAvis.enable_notebook()
result_visualized = pyLDAvis.gensim.prepare(model, corpus, dictionary)
pyLDAvis.display(result_visualized)
# Save result
RESULT_FILE = '/content/gdrive/MyDrive/NLP-final-project/result/lda_result_' + RATING + '.html'
pyLDAvis.save_html(result_visualized, RESULT_FILE)
5. Modeling positive review topics
Using the previously defined model training, topic-specific word composition output function, and visualization function, I will train and visualize the topic modeling model for each positive review and negative review. Here, a total of 10 constituent words (=NUM_WORDS) per topic were set.
model, corpus, dictionary = lda_modeling(review_prep_pos)
NUM_WORDS = 10
RATING = 'pos'
topics = model.print_topics(num_words = NUM_WORDS)
print_topic_prop(topics, RATING)
topic_num | word_prop | |
---|---|---|
0 | 1 | 0.094*"workout" + 0.066*"best" + 0.026*"fitnes... |
1 | 2 | 0.122*"use" + 0.114*"easy" + 0.028*"simple" + ... |
2 | 3 | 0.044*"weight" + 0.039*"add" + 0.020*"training... |
3 | 4 | 0.056*"easy" + 0.046*"workouts" + 0.024*"amazi... |
4 | 5 | 0.045*"workouts" + 0.028*"meal" + 0.025*"time"... |
5 | 6 | 0.045*"workouts" + 0.043*"free" + 0.029*"worko... |
6 | 7 | 0.022*"workout" + 0.022*"exercises" + 0.020*"g... |
7 | 8 | 0.019*"get" + 0.013*"recipes" + 0.013*"music" ... |
8 | 9 | 0.021*"fit" + 0.020*"update" + 0.016*"wish" + ... |
9 | 10 | 0.051*"track" + 0.038*"keep" + 0.023*"keeps" +... |
lda_visualize(model, corpus, dictionary, RATING)
6. Modeling Negative review topics
model, corpus, dictionary = lda_modeling(review_prep_neg)
NUM_WORDS = 10
RATING = 'neg'
topics = model.print_topics(num_words = NUM_WORDS)
print_topic_prop(topics, RATING)
topic_num | word_prop | |
---|---|---|
0 | 1 | 0.020*"year" + 0.013*"lost" + 0.011*"app" + 0.... |
1 | 2 | 0.026*"sign" + 0.026*"google" + 0.015*"keeps" ... |
2 | 3 | 0.028*"charged" + 0.027*"subscription" + 0.022... |
3 | 4 | 0.020*"steps" + 0.016*"working" + 0.016*"data"... |
4 | 5 | 0.036*"steps" + 0.015*"track" + 0.014*"count" ... |
5 | 6 | 0.023*"use" + 0.017*"account" + 0.016*"dont" +... |
6 | 7 | 0.039*"pay" + 0.022*"free" + 0.019*"money" + 0... |
7 | 8 | 0.018*"keeps" + 0.013*"plan" + 0.013*"used" + ... |
8 | 9 | 0.024*"update" + 0.019*"use" + 0.018*"time" + ... |
9 | 10 | 0.035*"free" + 0.026*"download" + 0.022*"get" ... |
6. How to interpret the results?
LDA topic modeling provides information on which keywords are composed of each topic and in what ratio. In other words, the user should understand the specific content of the topic through keywords. For example, a topic consisting of keywords such as ‘workouts’, ‘progress’, ‘easy’, and ‘plans’ will most likely be related to the ‘exercise record’ feature. As such, it is important for the LDA topic modeling technique to identify which keywords are in the topic and in what ratio. Considering these characteristics, I will discuss how to effectively interpret the data visualized through pyLDAvis.
1. Relevance
Relevance() can be adjusted through the sliding bar on the upper right in the figure below. Relevance is a hyperparameter that balances the frequency of occurrence of a word in a topic with the frequency of its occurrence in the entire document. That is, when there is a word with a high frequency of appearance in a specific topic, whether the word has a high frequency of appearance because it is a keyword that distinguishes the topic from other topics, or simply because it is a word widely used in various document data. It is a parameter that helps to clearly distinguish whether or not was high.
Figure 1. Review topic modeling visualization results of positive evaluation.
The Relevance value is a value between 0 and 1, and the closer it is to 0, the less the number of occurrences in the entire document is, but the focus is on whether the topic is a word that can differentiate it from other topics. On the other hand, the closer the Relevance value is to 1, the more likely it is to be a keyword that appears frequently in the entire document data rather than a keyword constituting a specific topic. For example, in healthcare app review data, the word ‘exercise’ is very likely to appear in multiple reviews. Therefore, it is difficult to clearly distinguish one topic from another by simply using the word ‘exercise’. In these cases, setting the Relevance close to zero can penalize the importance of the word ‘exercise’, which appears in many reviews. This shows that the word ‘exercise’ is a widely used word throughout the document, rather than only appearing a lot in that topic. According to a study by Sievert & Shirley (2014), a Relevance value of 0.6 is known to be the most effective. However, this value is not always correct. This is because the optimal Relevance value may differ depending on each research domain, dataset, etc.
2. Topics and keywords
All circles on the left in the figure below are each topic. The distance between circles means how similar topics are to each other. A larger circle means that the topic has more words (=tokens). If you hover your mouse over the circle, the ratio of the words constituting the topic to the total document data is displayed on the right side of the current topic’s keywords. It also provides the ratio of the words that the topic constitutes to the words of the entire document data. In this way, by identifying which words are composed of each topic and at what ratio, the topic of each topic can be inferred, and furthermore, which topic is composed of the entire document data and at what ratio (= importance).
Figure 2. Review topic modeling visualization results of Negative evaluation
7. insights
Analyze user needs based on the visualization results.
1. Positive review analysis
First, the results of topic modeling of positive reviews are as follows.
- Exercise action description function
You can see that the words ‘useful’, ‘accurate’, ‘easy’, ‘simple’, ‘steps’, and ‘follow’ appear frequently in topic 2. It can be interpreted that content that explains the movement of exercise step by step received positive reviews. Therefore, when planning a health care app service, I can consider video-based exercise action lecture content.
- Meal record feature
In topic 5, words such as ‘meal’, ‘track’, ‘schedule’, and ‘plans’ appeared frequently. Through this, it can be interpreted that the meal record function, such as taking a picture of the meal and saving it, had a positive effect on controlling the diet. Computer Vision technology allows you to analyze what food you ate and how much you ate through food photos. This technology is expected to provide a positive user experience in terms of convenience by reducing the hassle of having to record dietary information one by one.
- exercise record feature
In topic 10, words such as ‘workouts’, ‘progress’, ‘motivated’, ‘steps’, ‘track’, and ‘helps’ appeared frequently. This can be interpreted as having a positive effect on providing interest and motivation for exercise by making an exercise plan through exercise log recording and checking exercise quantity through exercise amount measurement. As such, adding an exercise log recording feature to the health caring app service that can help users record the amount of exercise and type of exercises, is expected to help promote regular exercise and use of the app.
2. Negative review analysis
- Automatic paid subscription complaint
You can see that words such as ‘charged’, ‘refund’, ‘subscription’, ‘trial’, ‘cancel’, and ‘free’ appear frequently in topic 3. This is a number of complaints caused by the operating method of some healthcare apps provide free services for the first 1 to 3 months and then switch to paid subscription services without the user’s additional consent after the trial period. There are a lot of reviews in these service policies, where many users request immediate cancellation of subscription and refund. Therefore, in the healthcare app operation method, it is necessary to switch to providing payment and paid service only when the user is additionally asked for payment before switching to a paid subscription service and agrees to this.
- Exercise tracking accuracy issue
You can see that the words ‘steps’, ‘track’, ‘count’, ‘accurate’, ‘gps’, ‘distance’, ‘mile’, ‘work’, and ‘wearable’ appear frequently in topic 5. There is a lot of negative feedback related to accuracy issues in exercise tracking, such as step count. For example, the app says that you have taken 10,000 steps, but your smartwatch only counts 5,000 steps. Users may underestimate the reliability of the service as a whole because of the low accuracy of these workout tracking. Therefore, when designing a healthcare service, it is necessary to continuously improve the accuracy of exercise tracking not only in the app but also in the wearable device environment.
Leave a comment