Can we use social media such as Twitter to extract interesting signals about consumer behavior? If we can predict consumer behavior by monitoring social media, we can use those signals to provide recommendations for sellers inside eBay. In particular, we can make recommendations to Terapeak sellers about what to sell to make the maximum profit from their investments. But, first we need to validate our assumption stated as, “one can use social media to predict people's consumption behavior and make recommendations to sellers about what to sell inside eBay, in order to maximize their profits."
Hypothesis: There is a correlation between what people post and discuss in social media and their consumption behavior in e-commerce.
Since Fall 2012, we have experienced a flu outbreak in the US and Canada. There are several reports about how this flu outbreak threatened seniors’ lives as well as children’s health in the US. There are several reports where physicians suggested that people use face masks in order to reduce the chance of catching the flu. As a result, the flu outbreak influenced the market as it caused Kimberly-Clark Corp. to increase its production of face masks and other personal-protective-equipment products. This connection motivated us to see if we could find any signals from social media that could be exploited for giving recommendations to sellers.
For our analysis we focus on the health area. In particular, we want to test to see if there is any correlation between the way that people tweet about the flu on Twitter and consumer behavior in the health domain in eBay.
Methodology: Employ Machine Learning and Natural Language Processing to Analyze Tweets
In this report, we assume that people who are sick are likely to post about their situation on social media websites like Twitter. Our Twitter results actually validate this assumption. To test our previously stated hypothesis, we conducted a small research by using the Terapeak tool as well as machine learning algorithms. We collected publicly available tweets (around 4 Million) from Twitter for a period of two weeks from Jan 13, 2013 – Jan 27, 2013. For collecting tweets, we had a set of flu-related terms by which we filtered Twitter stream to make sure that we only collected the tweets which mentioned a flu-related term. We extracted important information from each tweet including tweet text, tweet’s user ID, tweet’s posted time, tweet’s location. Below, we have shown one of our collected tweets as a sample:
Tweet #1: 2013-01-13 23:13:18;@;-73.07225285;@;44.80926225;@;sincerelyshelli;@;Twitter for iPhone;@;I have the flu
As we can see, the person who posted the above tweet clearly indicated that they have contracted the flu. Although we only collected flu-related tweets, we cannot assume that all tweets are sent only by people who have contracted it. The following tweets illustrate our point:
Tweet #2: "definitely coming down with the flu everyone's had :("
Tweet #3: "The great Boston influenza scare of 2012... #really? http://t.co/tjWfQ4SA;@;1;@;0.828821011434"
Although the second tweet (Tweet#2) has been posted by somebody who probably has the flu, the third tweet (Tweet#3) has not been posted by a sick person, as it just talks about news about the flu. Thus, we needed to use machine learning and natural language processing techniques in order to make sense of tweets and compute the probability that a tweet is posted by a person who is really sick. After collecting tweets, we developed a program to analyze all tweets and compute the probability that they are posted by somebody who has the flu. Next, we imported our 4 millions tweets into Lucene for post-processing steps.
For the first analysis, we focused on the flu rate for the two-week period in January 2013 (from Jan 13 to Jan 27). We executed our query to compute the number of posted tweets per day. We only took into account those tweets where we were at least 60% certain that they were posted by a sick person. As mentioned earlier, the confidence level for a tweet was computed using machine learning and natural language processing algorithms. The frequency distribution of tweets over time is shown in the following graph:
As we see in the graph, the rate of flu-related tweets is very high in the middle of January (around 300K flu tweets per day) compared to the end of January. This shows that the flu rate started decreasing from the end of January. Our Twitter analysis is in match with the results from the Centers for Disease Control and Prevention (CDC). As we see in the tweet rate figure, we were unable to collect tweets for Jan 19 and Jan 22-24 due to technical issues.
We then used the Terapeak tool to analyze the market behavior for flu-related products. We used the same keywords that we used for filtering the tweets (flu-related terms such as flu, vaccine, sick, coughing, and so on). Our intuition was that there should be a relation between the rate of flu-related tweets and the selling rate for health-related products in eBay. Real world events have an effect on e-commerce market places as indicated by our recent blog on the Meteor strike in Russia. Below we have shown two figures collected from our Terapeak tool.
As we see in above graphs, sales in the marketplace relating to flu medicines peaked in the December-January time frame. This observation matches the rate of flu-related tweets as well as the reported flu epidemic in North America during the same time frame. Interestingly, we also noticed another peak during the August-October time period in our analysis from the Terapeak tool. We were curious about this peak and did further research on the CDC website to find an explanation behind it. We extracted the graph of positive flu cases from CDC website for the same time period as shown below. We can see that the CDC graph (below) clearly indicates the peak in positive flu cases during the same period (August-October 2012).
Our results support the correlation between flu outbreak and consumer behavior in e-commerce for health-related products. In other words, we could clearly see that the flu outbreak positively influences the sale rate of flu-related products. We were able to test and validate our hypothesis by taking advantage of the Terapeak tool. In particular, the Terapeak tool provided market insights for us over a wide range of parameters and helped us validate a strategy based on social media. In summary, our analysis results show that harnessing social media in order to predict real world events early enough (such as flu outbreak) could give a seller an edge over competition. In other words, the real value lies in tracking trends in social media and using social media signals to predict consumer’s behavior in the e-commerce marketplaces and Terapeak provides an indispensable tool for this process.
USA Today: Flu Attacking Elderly at Historically High Rates
Daily Mail: Flu Epidemic Continues to Sweep the U.S.
WebMD: Swine Flu (H1N1) and Face Masks
Google Flu Trends
Centers for Disease Control and Prevention
Wall Street Journal: Kimberly-Clark Increases Face-Mask Production