When: September 2017 to present
I began working on a personal research project I call, Incontxt, focusing on understanding Pakistan’s online interactions. With mobile and data now available across the nation, our ever growing Internet users are teetering near 35 million. In a country where there are over 200 million people, and over 140 million registered mobile users, I have always been intrigued as to how this technology is being utilised, especially in the context of social media and education. Over 35 million people are connected to the Internet, and almost 90% of them are on social media. Yet in contrast, our literacy rate of 57%. Meaning there’s this huge chunk of people who are not traditionally educated but are using the internet – whether it’s through Whatsapp, Twitter, Facebook or other apps.
To me it poses an opportunity – the same one that I see working for so many online courses in the world. Can we tweak this model to fit developing nations who are slowly coming online? Can we also use this to provide non-traditional education – understanding your rights as a citizen, the importance of voting in a democracy, gender equality, domestic violence, understanding corruption – social problems that need alternative educational initiatives to help mitigate them.
Data has always been important to support such a hypothesis. This is why I am now tracking social media trends and outbursts for the past few months. Focusing on Twitter reactions to certain news in the country, I’ve been able to gather data to understand the compositions of our digital population’s reactions. Currently I’m running word frequency analysis on some topics, but am working to add some basic natural language processing and sentiment analysis to further reveal what people generally have to say.
The main purpose being to identify the huge divide between reactions, to understand how important the truth is to the general population and to also showcase the various problems prevailing in how people behave on social media as a reflection of the underlying social problems in Pakistan. I want to be able to use the massive amount of data online to effectively highlight the relation between our offline social problems for example, the treatment of women in society, to how people interact and react online, for example, a Pakistani netizen hurling abuse at a woman online for wearing what she wants.
million people online
Phase 1: Collecting data
- Selecting trends/topics to collect tweets about
Phase 2: Cleaning the data
– Converting JSON tweet data into CSV
– Removing common words
– Removing keywords
Phase 3: Analyse the data
a) Run word-frequency analysis
b) Run sentiment analysis on the tweets
Phase 4: Visualise the data
a) Bubble charts to show the frequency of words used in the tweets
Phase 5: Evaluate
– Potentially find compelling data proving the problems that exist
– Design ways to counter those issues, like gender inequality
– Work with on the ground organisations and develop education programs to be disseminated online and through cellular networks.
Methodology & Process
The first step was to collect data. Any social media is ripe with controversies, news, issues and viral trends. I selected some notable incidents that had garnered a huge response online to study for the purpose of this project.
The purpose being to collect tweets relating to these incidents to then run word frequencies and sentiment analysis to better understand the context and views of the online citizens. I was hoping to find reoccurrences of common misconceptions and biases.
I wanted to first build a program that collected tweet but I found a good alternative in twitterscraper that helps collect tweets from past events which is what I needed. The new Twitter API only let’s you get tweets from within the past 7 days. I decided to select certain social media “events” to collect tweets about. I chose to collect tweets from the day the news broke out to one week after using keywords to search for the relevant tweets.
The tweets were collected in JSON format, so I built a small python script to convert this to CSV format so I could process the data more effectively. This way each tweet cold be broken into column of “username”, “tweet text”, “retweets”, “likes” to better understand each tweet.
After this I decided to first use word frequency analysis on the tweet texts. I built a small python script that first broke down each tweet into words, and then looped over to count how many times a word was said. I removed unknown characters to avoid discrepancies. I also removed common words like sentence conjunctions and words repeated most frequently without much impact on the overall story( “of”,”on”,”from”,”to”,”the”, “this”, “that”,”where”).
Once this simple word count dataset was built, I built a bubble chart to visualise the word frequency. the bubble chart was built in D3.js using numerous open-source examples as a guide.
The White Dress
This bubble chart showcases words used in the tweets surrounding the widely discussed dress incident that Pakistani actress, Mahira Khan was caught up in. Common words and keywords relating to the incident itself (variations of “Mahira Khan”, “Ranbir Kapoor” and related hashtags) were removed to prevent distortion of the scale.
The Friend Request
The bubble chart below documents word frequencies of tweets from the aftermath of Oscar-winning, documentary-maker, Sharmeen Obaid-Chinoy’s viral tweets accusing a doctor of harassing her sister by sending her a Facebook friend request after her emergency check-up by the doctor. Common and redundant words were removed from the original dataset to avoid distortions.