This is a summary presentation about the final group project I worked on during this winter for the Data Mining course in the Masters of Data Science and Analytics program at Ryerson University.
In this project we use daily world news (and more specifically the /r/worldnews subreddit) to try to predict trends (up or down) on the Dow Jones Industrial Average daily prices. The idea for this project is not originally mine, and it was first posted as part of a Kaggle dataset, with many kernel submissions , and our project changed a couple of things:
- Reprocess the data from the source: Extract the /r/worldnews directly from the complete reddit dataset, get up/down from DJIA data coming from wsj.com
- Change analytics tool: Use KNIME instead of R, Python or the likes
- Spent some more time with EDA: And it wasn’t even enough, if we would have had more time we may have with the same conclusion way earlier
Using the complete Reddit dataset available (posts, comments, everything!) to reprocessing the data (and get to the same data as the Kaggle dataset) was a very interesting exercise: I used Azure HDInsight to rapidly create a cluster and Hive to process and filter the JSON files to extract just the subreddit content. The DJIA data is much smaller (and simple to manage) and then both of them were joined to obtain a dataset similar to the one from Kaggle.
In a future post, I will publish the project report paper we published with our detailed procedure and reports.