DATA METHODOLOGY

 
This section gives a walkthrough of methodologies and aims to spread the knowledge we have.

Google trendS

What is Google Trends?

“Google Trends is a website by Google that analyzes the popularity of top search queries in Google Search across various regions and languages. Google Trends provides access to a largely unfiltered sample of actual search requests made to Google.” - Wikipedia


Google Trends provides 4 analytical outputs: Interest over time, interest by subregion, related topics, and related queries. Users can select different time periods, different regions, and different channels (Image Search, News Search, Google Shopping, and YouTube Search).

 

  • Interest over time: Search index is a normalized number reflecting the popularity of search requests. Index is scaled from 0 to 100, with 100 being the highest frequency within a time range. Specifically, a value of 100 is the peak popularity for the term, while a value of 50 means that the term is half as popular compared to the peak. Search index, though is a relative number, can provide us with a comparison between different time periods.

  • Interest by subregion: Activity from different regions can be tracked through this section. It is still using a normalized search index, but the  value of 100 represents the highest popularity a certain region has within a fixed time range. However, it is worth mentioning that this index is affected by the population of a region because: “A higher value means a higher proportion of all queries, not a higher absolute query count. So a tiny country where 80% of the queries are for ‘bananas’ will get twice the score of a giant country where only 40% of the queries are for ‘bananas’.

  • Related topics and related queries have “Rising” and “Top” selection bottoms which refer to topics/queries that enjoy the most increase in search frequency and the highest popularity, respectively.

Why Google Trends?

  1. Google Trends is built based on Google’s large actual search requests. It collects the true voices and aggregates them into the search index. Its volume and truthfulness make it one of the best practices/tools in social media analysis.

  2. Google Trends can provide search popularity as early as 2004. It is especially suitable for trend analysis and comparison between different time periods.

  3. It is user friendly. Users with no technical background are able to leverage it as an analytical tool and get rich insights.

Limitations?

  1. While only a sample of Google search queries are collected in Google Trends is generally sufficient, we might face a situation where we cannot find the desired queries or search terms in Google Trends.

  2. Search index is normalized so the definition of benchmark “100” is crucial. However, the benchmark might change in some cases. For example, if we looked into the “BLM” term in Google query on May 25, the search index on May 25 would be the highest 100. If we looked again into “BLM” term in Google query on June 25, the search index on June 25 would be the highest 100 with that on May 25 being close to 0. This causes much confusion and inconsistency in comparison. Solutions might be found here

  3. Interest by region is highly affected by the population of a region.

TWITTER ANALYSIS 

What is Twitter Analysis?

Twitter analysis, by our definition, includes all steps beginning from data collection to data cleaning, and finally data analysis (Natural Language Processing, or NLP). To be able to conduct analysis on Twitter, proficiency in Python is required. 

  • Data Collection: Several packages in Python give us access to Twitter – GetOldTweets3, Tweepy, etc. Tweepy is the official Twitter API that provides rich information about user profiles and tweets. First, we needed to apply for a developer account (many get rejected). Another limitation is that tweets that are earlier than 14 days are not extractable with a free account. That’s where GetOldTweets3 comes in. GetOldTweets3 can get us tweets from any time. However, there is no free lunch. GetOldTweets3 faces the limitation of requests. By far, the maximum number of tweets that we can extract for each request is about 10k. Every time we request again, it would not continue the last request but would start all over again returning the same results as the last request.

  • Data Analysis (NLP): 

“Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.” - Wikipedia

 

NPL has rich domains ranging from sentiment analysis to topic analysis to language modeling (e.g. talking robot). For this project, we only touched on simple applications of NLP: sentiment analysis, word cloud visualizations, topic analysis.

1. Sentiment Analysis: We were able to detect whether a tweet is positive, negative, or neutral. It was useful in identifying the sentiments of the public. Our sentiment analyzer was trained on Flair with the BERT pre-trained model. Finally, the accuracy was about 70% on the test set.

2. Word Cloud: Word cloud is a classic and the most well-known application of NLP. It counted the frequency of each word and gave words of higher frequency more emphasis within the visual; It showcased what keywords are most talked about.

3. Topic Analysis: We analyzed hashtags to get a sense of what topics were trending. The limitation of the word cloud is that many unimportant words might show up, which can be misleading. Hashtag analysis used hashtags that the users intentionally tagged along with tweets, which we assumed to be more concise and related to the topic of Black Lives Matter.

Why Twitter?

  1. Twitter provides a uniquely situated space that is multivocal and dialogic in which there are many views and channels collected within one platform, and there are active dialogues taking place around these diverse views.

  2. Compared to other social media, Twitter provides access to its data while others don’t.

Limitations?

  1. Due to limitations in getting data, we only extracted about 20k tweets with keywords “BLM” and “Black Lives Matter” each day from May 25 to June 25. Those only comprised samples of the whole corpus.

  2. Sentiment analysis has an accuracy of only 70%. Just like human beings, the analyzer sometimes cannot detect sentiments and is not fully accurate. However, it still provides a sense of what the public are feeling about.

  3. Tweets can be “dirty” because of ads or Twitter robots. Though it may not be the case in our analysis since true users were tweeting at an extremely large scale, it still might be a problem for other projects.

  4. A very specific audience and demographic tends to frequent Twitter as a means of social media, and an even smaller proportion of the population uses Twitter as a means/as their main means of activism.

 

spread the word

Chelsie Lui, Jin Pu, Mikayla O'Reggio

Data Strategy Fellowship Program, ParsonTKO & TechSoup 2020

All Rights Reserved.

This site is optimized for desktop view

  • LinkedIn
  • Facebook
  • Twitter
  • Pinterest