Analysing WhatsApp messages with Python and Tableau

I was scrolling through the WhatsApp messages sitting in my phone while trying to recollect the things I did over the last 2 years. Now why was I trying to scroll though WhatsApp, that too to recollect what I had done??  Because my WhatsApp contained a story, a story of my life in grad school(s), of all the connections I made, of those still connected and of those who lost contact. The mass forwards and spam apart WhatsApp and especially WhatsApp groups have played a big part in helping me stay in touch with family, high school and undergrad friends, coordinating with project teammates, planning trips among other things.

I wanted to analyze and visualize these conversations and the first step needed was to obtain the text file for the messages. The only option was to email myself the conversation files(.txt) from the app since the backup files are all encrypted. After choosing the ‘without media’ option I emailed myself some conversations and group chats from my WhatsApp chat list. ( Check this for email instructions- WhatsApp chat)

Merge the .txt files using Python

If you want to analyze chat from a group you will export a single file however if you want to analyze texts from different individual users you will have multiple files. we need to merge the files before parsing .

from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir("path") if isfile(join("path", f))]
for file in onlyfiles :
f = open('path'+file, 'r', encoding="utf8")
w = open('path\\FinalFile.txt','a', encoding="utf8")
w.write(f.read())
f.close()
w.close()

The method listdir() returns a list containing the names of the entries in the directory given by path. Then a list called onlyfiles is created. The for loop will loop through the list of file names and reads them. The text is copied into the FinalFile.txt file

Parsing the text file

We will use re , Pandas and numpy Python packages for the analysis.(You will find more info and documentation here –  pandas and numpy  . Both packages are included if you are using the Anaconda distribution – Anaconda )

Each line in the WhatsApp conversation text file has date & time, the sender and then the message. Depending on the phone you use, app version, location and your phone settings the date separators etc can be different. For the android device that I am using the text looks like this

8/20/15, 1:30 PM - Yashodhan: How is California? Done with your orientation?

8/20/15, 1:34 PM - Sahana: Not at all settled in. I over slept and bunked orientation. Lol

We need to parse the text file in order to get the information in a useful format. Using regular expression the text is saved as a list . Using pandas to_csv we save the list as a csv file which contains DateTime, Sender, Message.

import re
import pandas as pd
import numpy as np
f = open('path\\to\\file\\FinalFIle.txt', 'r', encoding="utf8")

# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall('(\d+\.\d+\.\d+,\s+\d+:\d+\s+\w+)\.\s+\-\s(\w+\s\w+\s\w+|\w+\s\w+|\w+)\:(.*)', f.read())
f.close()

#Convert list to a dataframe and name columns
MsgTable= pd.DataFrame(strings,columns =['DateTime','Sender','Message'])
MsgTable.to_csv(‘path\\to\\file’, index=False, header=False)

The next step is to analyze the text content. We convert the Message column from the DataFrame to a list. the output of MsgList converts every line into an item. But we are interested in the words usage, so we join the list of strings into a single string and then convert all words to lower case. The .split()  splits the string into a list of individual words.

MsgList = MsgTable['Message'].tolist()
WordList = ' '.join(MsgList).lower().split()

Using Counter we count the frequency of words in the list and add the word frequency list into a DataFrame.

from collections import Counter
df = pd.DataFrame.from_dict(dict(Counter(WordList)), orient='index').reset_index()
df.columns =['Word','Count']

Analyze the text

Since we are only interested in the words using regular expressions we eliminate all non-alphabetical characters, hyperlinks etc from the DataFrame. WhatsApp txt files show <media omitted> in place of media files, so you probably don’t want to count the words media and omitted. You also may not want to count articles and many other things. All words that we do not want to count are added to the List DropWords. Eliminating those words from the DataFrame will leave empty spaces . using Numpy we replace empty spaces with NaN and then drop the rows containing NaN to get a clean list of the words whose frequency we are interested in knowing. The DataFrame is then saved as a CSV file containing Words and Frequency

df['Word'].replace(regex=True,inplace=True,to_replace=r'\d|\W|\?|http*',value=r'')
DropWords = ['','a','and','u','to','for','with','of','in','omitted','image']
df['Word'].replace(DropWords, np.nan, inplace=True)
df.dropna(subset=['Word'], inplace=True)
df.sort_values(by=['Count'],axis=0, ascending=False, inplace=True)
df.to_csv(Path\\to\\file\\Word_Count.csv', index=False, header=False)

The response matrix

When trying to analyze a group chat one factor we can analyze is who responds to whom and how often do they do that. We can construct a response matrix. For doing this we ignore the response to self.The output is again saved as CSV

MsgResponse = MsgTable['Sender']
labels = MsgTable['Sender'].unique()
response = pd.DataFrame(0,index=labels,columns=labels)
#Column = Sender, row = response
i =0
while i &amp;amp;amp;lt; (len(MsgResponse)-1):
    if MsgResponse.iloc[i]!= MsgResponse.iloc[i+1]:
        response.loc[MsgResponse.iloc[i+1],MsgResponse.iloc[i]] += 1
    i+=1
response.to_csv('Path\\Response.csv', index=True, header=True)

 

blog007

The response matrix – Row = Sender, Col = Response

Visualizing the WhatsApp Chats

I used Tableau to visualize the data from the CSV files. I wanted to know what time of the day I usually message and how I have stayed in contact over WhatsApp with my contacts. The following 2 charts show the visualizations for the same.

blog001

Messaging activity by the hour (sent and received , 12 individuals + me)

blog002

Messages received plotted by the week (Aug 2014 to June 2016 , from 12 individuals)

Analysis of  a Group Chat

I also analyzed conversations from a group. for similar parameters the activity during the day, messages sent and received, who responds to whom etc. We can see that the most active user is Ryan (actual names were replaced ), the group activity peaks at 10 P.M. and most messages are exchanged in the group on Tuesdays. no surprises the most used word was “LOL”

blog009

Word Frequency – Top 15 Words

blog004

The number of Messages Sent on the group

blog006

The activity by the hour. The group members were spread across 5 time zones in India , Europe and USA

 

The chart below shows the activity of users on the group over two years. The gaps show no activity and the colored regions show days on which messages were sent by the particular users.

blog005

The activity of the users on the group over two years.

 

blog003

Usage by Day of the week ( 1- Sunday, 2 – Monday …, 7-Saturday)

 

Visualizing the response matrix

I created a dashboard in Tableau to visualize the Response Matrix. With the Dashboard we can know of all the responses a person got who responded the most etc.

blog008

The Dashboard screenshot shows the breakup of responses Scott got

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s