News Stock Market Prediction with Reddit

Here is my final report from my last research project. We used headlines from the r/worldnews subreddit predict the DJIA index trend, combining a variety of technologies like Azure HDInsight with Hive for Big Data processing in the cloud and KNIME for the advanced and text analytics pipeline. The conclusions are not what we expected them to be at all, and they show the high importance and value of early Exploratory Data Analysis and making sure you have access to the right data.


Investment Fund Analytics: Using Daily World news for Stock Market Prediction

This is a summary presentation about the final group project I worked on during this winter for the Data Mining course in the Masters of Data Science and Analytics program at Ryerson University.


In this project we use daily world news (and more specifically the /r/worldnews subreddit) to try to predict trends (up or down) on the Dow Jones Industrial Average daily prices. The idea for this project is not originally mine, and it was first posted as part of a Kaggle dataset, with many kernel submissions , and our project changed a couple of things:

  • Reprocess the data from the source: Extract the /r/worldnews directly from the complete reddit dataset, get up/down from DJIA data coming from wsj.com
  • Change analytics tool: Use KNIME instead of R, Python or the likes
  • Spent some more time with EDA: And it wasn’t even enough, if we would have had more time we may have with the same conclusion way earlier

Using the complete Reddit dataset available (posts, comments, everything!) to reprocessing the data (and get to the same data as the Kaggle dataset) was a very interesting exercise: I used Azure HDInsight to rapidly create a cluster and Hive to process and filter the JSON files to extract just the subreddit content. The DJIA data is much smaller (and simple to manage) and then both of them were joined to obtain a dataset similar to the one from Kaggle.

In a future post, I will publish the project report paper we published with our detailed procedure and reports.

#FluxFlow: A visual tool to analyze “anomalous information” in social media

Thanks to the Social Media Analytics course I’m taking as part of my Masters in Data Science program, I found a very interesting paper about #FluxFlow that I had to summarize and present.

#FluxFlow is an analytics data visualization tool that helps identifying and understanding how ‘anomalous’ information spreads in social media. In the context of social media, “anomalous information” can be in most cases equated to rumors and ‘fake news’. Having a tool like this available to understand how this type of patterns work can help identifying and taking action over potentially harmful consequences.

The original paper (written by Jian Zhao, Nan Cao, Zhen Wen, Yale Song, Yu-Ru Lin, Christopher Collins) used for this research is available here for you to read plus a very concise and descriptive video here, and also the real #FluxFlow tool is here for you to see and understand. I created a super simple and brief presentation to summarize the tool and its potential applications to other scenarios.