Breaking down the ‘Interview of the Decade’

( If you want to jump straight to the analysis, the notebook is on  GitHub )

Ma’am “The first time you met Indira Gandhi… do you remember it?” Rajdeep Sardesai asked Sonia. The question did not feel odd to start the interview with after all they were in Indira Gandhi’s birth place ‘Swaraj Bhawan’ at a time of her centenary celebrations. As the interview progressed it became clear that it was going to be all about Indira, more like a daughter in law speaking about her mother in law.  There was no big breaking news in the “Interview of The Decade”.

When Rajdeep attempted to ask a question on Vote Bank Politics, “No politics… Mrs Gandhi only today” she curtly reminded him. “Fair enough” said Rajdeep! making no attempt to counter. It was clear there was no room for ‘uncomfortable’ questions. The twitteratti as usual was quick to point out Rajdeep’s previous position on interviews!

I think Rajdeep has a lot of questions to answer about the questions he did not ask! I believe the bigger story lies in what was avoided. But let’s leave that for the political analysts and journalists to decide. Let’s breakdown what we have! The Interview.

All analysis of the interview was done based on the interview transcript  from the India Today website – Interview Link . Analysis was done in Python using Pandas and NLTK,  The entire code and data can be found here – GitHub .

The 37 minute interview was similar in duration to other interviews Rajdeep has done. Does Rajdeep speak for a longer time than the one he is interviewing? probably not as much as other journalists  but he used more word than Sonia Gandhi for this interview. The graph below shows the cumulative word count for Rajdeep and Sonia.


Cumulative Word count vs Time

Sonia Gandhi spoke 88 times during the interview , used over 2000 words averaging about 26 words per answer (14 words per answer if we exclude stop words like then, that,etc).  There were two questions for which Sonia had a lot to speak about, she took more than a minute to answer each of them. No surprise that the two questions were about Indira. The first question was about her first interaction  with Indira Gandhi and the second was about the time of Indira’s death. She appeared emotional while answering both.time.png

Did the length of the question have anything to do with the length of her answer? It looked like in some cases the questions could be answered with only a ‘Yes’ or ‘No’ or probably Sonia chose to answer just with that.  The scatter plots below show the number of words in Rajdeep’s questions vs number of words in Sonia’s answers. Most of the interactions appear to be under 20 words each.


‘Gandhi’ was the most used word in the interview, Not that Gandhi obviously! It was all about the “Saas, Bahu and Pariwar”.

Word cloud for all the words in the interview and Individual word clouds for Sonia and Rajdeep. ( All word clouds in the blog post were created in Python –  WordCloud )

Screen Shot 2016-11-24 at 10.21.58 PM.png


Word clouds for Rajdeep and Sonia Gandhi

It seemed like she answered a lot of questions with really short answers. The histogram below shows the frequency of words used in every answer (after removing stop words) . Out of the 88 times she spoke,she answered 17 times using 3 words or less! (21 times if we Ignore stop words). Most of her answers were under 10 Words!

Even if you watched the interview without paying complete attention you would have noticed she answered a lot of questions with “absolutely” . She used the term 14 times!

Screen Shot 2016-11-25 at 2.26.04 PM.png

Sonia Gandhi’s agenda was clear, to not let the interview move into any other direction other than Indira Gandhi. Understandably she refused to take the name “Modi”, even after Rajdeep Sardesai brought it up  4 times. When asked if congress needed a leader like Indira to take on Modi ? “I don’t agree with that” Sonia replied.

Rajiv,Rahul and Priyanka found 6,3 and 2 mentions respectively.

Number of times Corruption, Office of Power, Shadow Prime Minister, Bofors, National Herald, Defense, Terrorism, Policy etc was mentioned? No surprise there, ZERO!

Screen Shot 2016-11-25 at 4.09.41 PM.pngShould journalists be more open about what happens behind the scenes? Should journalists hold themselves to higher moral standards? Did it deserve to be called the “Interview of the Decade” ? One of Rajdeep’s recent tweet probably has the answer  “ yeh public hai yeh Sab jaanti hai, andar Kya hai, Bahar Kya hai, Sab pehanchati hai!” (The Public Knows and Understands everything!)


Tour de France

Tour De France is undoubtedly the most prestigious race of the many that happen across the world. I was blissfully unaware of the magnitude and the scale of this event, in fact I had little knowledge of it beyond the Lance Armstrong doping scandal and the occasional posts that popped up on my social media news feed. I was surprised when I found out the race happens to be over a 100-years-old! The first race was held in 1903 to increase the sales of a paper in France. What started off as a marketing gimmick grew on to become one of the most followed sporting event around the globe.

My Cousin happens to be a huge fan of the sport and while visiting him not only did I end up watching multiple stages of the 2016 race, but dinner table conversations also revolved around it. The more interested I got the more questions started popping in my head. Was it the human or the bicycle or a combination of both? Had better equipment made cycling longer distances possible? Were people cycling faster?

Since the first race in 1903, the race has been happening every year with the exception of the years coinciding with the two World Wars. A lot has changed since then, the speed, the technology, the duration and the fan base. The number of participants have gone up from the time the race started and so has the percentage of people who finish the race. The longest race was held in 1926 with 5745 km over 39 days. Except for 2003 the race distance has been under 3600 km for the last 10 years.

Finding the data

While hunting for race statistics I came across this site ( ) which has some great Tour de France data. I used Python with Pandas, numpy and matplotlib for parsing,cleaning,editing and plotting. The first step was to extract the table from the webpage. The output was a list of tables and the required table was obtained from the list. After that the data had to be cleaned and normalized before plotting. All plots were created using Matplotlib.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
tdf = pd.read_html('')
tdf_final = tdf[2]


The scatter plot below shows a logical trend, the average speed decreases as the distance covered in a race increases. The outliers on the left bottom of the graph are of the races from 1903-1905 which had both lower distances and speed.


The scatter plot for average speed of the winner vs Total distance



Race distance by the year



Race Distance and Speed by the year

The graph shows the average speed of the winners have been increasing from 25.27 km/h in 1903 to 39.64 in 2015. While the average speed has been increasing, the race distance has been decreasing. Last 10 years have seen race distances of under 3,600 km.


Race Duration(in Days) and Number of Stages by the year



Entrants and Finishers by the year

The number of participants in the race has been increasing. From 60 in 1903, the 2015 race saw 198 participants. The proportion of people who complete the race has also increased over the years.


This should be obvious, even the scatter plot agrees. Races with longer duration have more distance covered

We can obtain pair plots using pairplot in seaborn

import seaborn as sns
sns.pairplot(data=tdf[["N_Duration","N_Length","N_Entrant","N_Finished","N_Avg Speed"]], dropna=True)

The pair plot


3D Scatter Plot of Distance, Duration and Avg Speed of the winner

Analysing WhatsApp messages with Python and Tableau

I was scrolling through the WhatsApp messages sitting in my phone while trying to recollect the things I did over the last 2 years. Now why was I trying to scroll though WhatsApp, that too to recollect what I had done??  Because my WhatsApp contained a story, a story of my life in grad school(s), of all the connections I made, of those still connected and of those who lost contact. The mass forwards and spam apart WhatsApp and especially WhatsApp groups have played a big part in helping me stay in touch with family, high school and undergrad friends, coordinating with project teammates, planning trips among other things.

I wanted to analyze and visualize these conversations and the first step needed was to obtain the text file for the messages. The only option was to email myself the conversation files(.txt) from the app since the backup files are all encrypted. After choosing the ‘without media’ option I emailed myself some conversations and group chats from my WhatsApp chat list. ( Check this for email instructions- WhatsApp chat)

Merge the .txt files using Python

If you want to analyze chat from a group you will export a single file however if you want to analyze texts from different individual users you will have multiple files. we need to merge the files before parsing .

from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir("path") if isfile(join("path", f))]
for file in onlyfiles :
f = open('path'+file, 'r', encoding="utf8")
w = open('path\\FinalFile.txt','a', encoding="utf8")

The method listdir() returns a list containing the names of the entries in the directory given by path. Then a list called onlyfiles is created. The for loop will loop through the list of file names and reads them. The text is copied into the FinalFile.txt file

Parsing the text file

We will use re , Pandas and numpy Python packages for the analysis.(You will find more info and documentation here –  pandas and numpy  . Both packages are included if you are using the Anaconda distribution – Anaconda )

Each line in the WhatsApp conversation text file has date & time, the sender and then the message. Depending on the phone you use, app version, location and your phone settings the date separators etc can be different. For the android device that I am using the text looks like this

8/20/15, 1:30 PM - Yashodhan: How is California? Done with your orientation?

8/20/15, 1:34 PM - Sahana: Not at all settled in. I over slept and bunked orientation. Lol

We need to parse the text file in order to get the information in a useful format. Using regular expression the text is saved as a list . Using pandas to_csv we save the list as a csv file which contains DateTime, Sender, Message.

import re
import pandas as pd
import numpy as np
f = open('path\\to\\file\\FinalFIle.txt', 'r', encoding="utf8")

# Feed the file text into findall(); it returns a list of all the found strings
strings = re.findall('(\d+\.\d+\.\d+,\s+\d+:\d+\s+\w+)\.\s+\-\s(\w+\s\w+\s\w+|\w+\s\w+|\w+)\:(.*)',

#Convert list to a dataframe and name columns
MsgTable= pd.DataFrame(strings,columns =['DateTime','Sender','Message'])
MsgTable.to_csv(‘path\\to\\file’, index=False, header=False)

The next step is to analyze the text content. We convert the Message column from the DataFrame to a list. the output of MsgList converts every line into an item. But we are interested in the words usage, so we join the list of strings into a single string and then convert all words to lower case. The .split()  splits the string into a list of individual words.

MsgList = MsgTable['Message'].tolist()
WordList = ' '.join(MsgList).lower().split()

Using Counter we count the frequency of words in the list and add the word frequency list into a DataFrame.

from collections import Counter
df = pd.DataFrame.from_dict(dict(Counter(WordList)), orient='index').reset_index()
df.columns =['Word','Count']

Analyze the text

Since we are only interested in the words using regular expressions we eliminate all non-alphabetical characters, hyperlinks etc from the DataFrame. WhatsApp txt files show <media omitted> in place of media files, so you probably don’t want to count the words media and omitted. You also may not want to count articles and many other things. All words that we do not want to count are added to the List DropWords. Eliminating those words from the DataFrame will leave empty spaces . using Numpy we replace empty spaces with NaN and then drop the rows containing NaN to get a clean list of the words whose frequency we are interested in knowing. The DataFrame is then saved as a CSV file containing Words and Frequency

DropWords = ['','a','and','u','to','for','with','of','in','omitted','image']
df['Word'].replace(DropWords, np.nan, inplace=True)
df.dropna(subset=['Word'], inplace=True)
df.sort_values(by=['Count'],axis=0, ascending=False, inplace=True)
df.to_csv(Path\\to\\file\\Word_Count.csv', index=False, header=False)

The response matrix

When trying to analyze a group chat one factor we can analyze is who responds to whom and how often do they do that. We can construct a response matrix. For doing this we ignore the response to self.The output is again saved as CSV

MsgResponse = MsgTable['Sender']
labels = MsgTable['Sender'].unique()
response = pd.DataFrame(0,index=labels,columns=labels)
#Column = Sender, row = response
i =0
while i &amp;amp;amp;lt; (len(MsgResponse)-1):
    if MsgResponse.iloc[i]!= MsgResponse.iloc[i+1]:
        response.loc[MsgResponse.iloc[i+1],MsgResponse.iloc[i]] += 1
response.to_csv('Path\\Response.csv', index=True, header=True)



The response matrix – Row = Sender, Col = Response

Visualizing the WhatsApp Chats

I used Tableau to visualize the data from the CSV files. I wanted to know what time of the day I usually message and how I have stayed in contact over WhatsApp with my contacts. The following 2 charts show the visualizations for the same.


Messaging activity by the hour (sent and received , 12 individuals + me)


Messages received plotted by the week (Aug 2014 to June 2016 , from 12 individuals)

Analysis of  a Group Chat

I also analyzed conversations from a group. for similar parameters the activity during the day, messages sent and received, who responds to whom etc. We can see that the most active user is Ryan (actual names were replaced ), the group activity peaks at 10 P.M. and most messages are exchanged in the group on Tuesdays. no surprises the most used word was “LOL”


Word Frequency – Top 15 Words


The number of Messages Sent on the group


The activity by the hour. The group members were spread across 5 time zones in India , Europe and USA


The chart below shows the activity of users on the group over two years. The gaps show no activity and the colored regions show days on which messages were sent by the particular users.


The activity of the users on the group over two years.



Usage by Day of the week ( 1- Sunday, 2 – Monday …, 7-Saturday)


Visualizing the response matrix

I created a dashboard in Tableau to visualize the Response Matrix. With the Dashboard we can know of all the responses a person got who responded the most etc.


The Dashboard screenshot shows the breakup of responses Scott got