Some Key Lessons Learned in 2020

Last year (end of 2019) I started the practice of an end of year personal review and noticing how good it is for self reflection, decided to make it a staple of every New Year cycle, part as an improvement exercise and part of a rite of passage as well.

This is a summarized and aggregated reflection taken out of some of the notes jotted down throughout 2020 during my journaling practice, where I summarize my monthly and quartertly experiences and learning. Reflection and self understanding are such key parts of everyone personal growth and sharing is also part of that cycle to get perspective

Without any further ado, these are the most important things I learned in 2020

Reading books is the most effective way to change your perspective about things

The Kindle Oasis that I got on October 17, 2020 has changed my reading habits forever. I was in the Kobo ecosystem for 10 years and decided to try Kindle this year, and found a much needed and renewed passion for books and text consumption. Like everything, the newness factor will wear out in time but for now I am enjoying reading at least one or two books a week. And this has put so many new ideas and concepts in my head. Not every book is worth reading, not everything in a book is worth reading but the only way to find out is … by reading them! I was looking for this kind of insights for years and my final attempt was by binging YouTube videos. Only to find out that coming back to books was the answer I already knew from the beginning.

Journaling is Key for Self Reflection

Once I started to see the benefits that come with the disciplined practice of Journaling, there is no way to go back. This journey of exploration into different methods started in July 2019 first with the Best Self Journal. Further in 2019 I got to experiment more with the Bullet Journal Method in a regular notebook and also got to customize some of it. Throughout 2020 I journaled regularly on paper and in the final quarter also got to experiment a bit more by journaling digitally on OneNote with an iPad and Apple Pencil. Although still not sure which method is the best, doing it in any shape or form has tremendous benefits. Because you get to see the power of having “photographs” of your thoughts and brain that you can go back to in the weeks or months and even years after, re-read and reflect on; you will see your growth going from linear to exponential.

A Digital Productivity System is Key for High Perfomance Results

During some of the initial slow down during the first weeks of the pandemic, I decided to look inside and focus on understanding how to improve my personal productivity and organizational methods. And I found the P.A.R.A method from Tiago Forte, which has been a godsend to tame all the digital information most of us require to process, consume or discard every day. That paired with a refined approach to the Bullet Journal and a Digital note capturing and organizational system in OneNote have saved me and allowed me to keep my focus and mental sanity. All these has been also married with some of my own obsessions on self-quantification and self-metrics tracking and also provided me with numbers I can go back to and understand where have I spent most of the time in 2020 and with whom.

Consistency is always the key to not lose your Focus

Also to be fair, 2020 was not the best year to stay focused on any single thing with all the events that happened. But the main lesson learned here is that if you don’t have a system that allows you to register, remind and log your goals and habits (your memory doesn’t count as such system), the day-to-day will always eat up all the time available in your life. Having a constant reminder of what is your path, what you want and have to do, why you have to do it helps us staying consistent in the path that we set for ourselves in the times where we were clear and not overwhelmed by the daily grind.

Anything and everything can change in an instant

I think we have all experienced this in 2020, it was such a disruptive change of way of life that no one can ignore. On the first weeks of March I was on planes to Montreal, NY and Detroit, and 1 week later… I found myself locked down at home, and have not been able to go back to the previous rythm of life, ever againg (not yet at least). Nothing could have prepared us for what we have been through and also the speed of change.  We all somehow adapted, changed and survived to the single largest simultaneous global event in the history of Humanity.

Leadership to solve a problem requires having a different perspective

If you try to lead from the trenches, by rolling up your sleeves and working with your team like I did during some of 2020, you lose avery much needed strategic perspective. And it is also not fair to your team as you are not playing the role they need. The closer you are to a problem or to the tasks on a project, the more difficult it is to lead it with the tools and mindset required to look at it from a different viewpoint.

Coaching is about the Questions you ask and not about the People or the Problem

Similar to the perspective required for leadership, the only way to help others is to help them find their own solutions and not to provide solutions for them. The best way for this is coaching. Questions are the most important thing we can have to help others find their own professional or personal life. By asking some key questions, we can help others gain the perspective they sometimes cannot find within themselves and find the answers they need for the challenges they are facing. And there is nothing more fullfiling than connecting with others and to see them grow in the path that makes them fulfilled.

Named Entity Recognition from Online News

This is a project from the Natural Language Processing course in my Masters in Data Science program. The project aimed to create a series of models for the extraction of Named Entities (People, Locations, Organizations, Dates) from news headlines obtained online. We created two models: a traditional Natural Processing Language Model using Maximum Entropy , and a Deep Neural Network Model using pre-trained word embeddings. Accuracy results of both models show similar performance, but the requirements and limitations of both models are different and can help determine what type of model is best suited for each specific use case.

The final conclusion is that, as the Deep Learning Model is less dependent on specific language grammar rules, it is more generalizable (given embeddings and some labeled corpora is provided in any language) whereas the Maximum Entropy model will perform poorly on an language where there is no Domain Knowledge to create the required features.

This is our deck for the final presentation:


This is our final report / paper with our results and conclusion:


All source code for this project can be found in this GitHub repository: https://github.com/bnajlis/named_entity_recognition

News Stock Market Prediction with Reddit

Here is my final report from my last research project. We used headlines from the r/worldnews subreddit predict the DJIA index trend, combining a variety of technologies like Azure HDInsight with Hive for Big Data processing in the cloud and KNIME for the advanced and text analytics pipeline. The conclusions are not what we expected them to be at all, and they show the high importance and value of early Exploratory Data Analysis and making sure you have access to the right data.


Investment Fund Analytics: Using Daily World news for Stock Market Prediction

This is a summary presentation about the final group project I worked on during this winter for the Data Mining course in the Masters of Data Science and Analytics program at Ryerson University.


In this project we use daily world news (and more specifically the /r/worldnews subreddit) to try to predict trends (up or down) on the Dow Jones Industrial Average daily prices. The idea for this project is not originally mine, and it was first posted as part of a Kaggle dataset, with many kernel submissions , and our project changed a couple of things:

  • Reprocess the data from the source: Extract the /r/worldnews directly from the complete reddit dataset, get up/down from DJIA data coming from wsj.com
  • Change analytics tool: Use KNIME instead of R, Python or the likes
  • Spent some more time with EDA: And it wasn’t even enough, if we would have had more time we may have with the same conclusion way earlier

Using the complete Reddit dataset available (posts, comments, everything!) to reprocessing the data (and get to the same data as the Kaggle dataset) was a very interesting exercise: I used Azure HDInsight to rapidly create a cluster and Hive to process and filter the JSON files to extract just the subreddit content. The DJIA data is much smaller (and simple to manage) and then both of them were joined to obtain a dataset similar to the one from Kaggle.

In a future post, I will publish the project report paper we published with our detailed procedure and reports.

#FluxFlow: A visual tool to analyze “anomalous information” in social media

Thanks to the Social Media Analytics course I’m taking as part of my Masters in Data Science program, I found a very interesting paper about #FluxFlow that I had to summarize and present.

#FluxFlow is an analytics data visualization tool that helps identifying and understanding how ‘anomalous’ information spreads in social media. In the context of social media, “anomalous information” can be in most cases equated to rumors and ‘fake news’. Having a tool like this available to understand how this type of patterns work can help identifying and taking action over potentially harmful consequences.

The original paper (written by Jian Zhao, Nan Cao, Zhen Wen, Yale Song, Yu-Ru Lin, Christopher Collins) used for this research is available here for you to read plus a very concise and descriptive video here, and also the real #FluxFlow tool is here for you to see and understand. I created a super simple and brief presentation to summarize the tool and its potential applications to other scenarios.


Social Media Analytics: Bell Let’s Talk 2017

Two weeks ago I started the second semester of the Masters in Data Science program and as part of it I am taking a course in Social Media Analytics. The first lab assignment for this course was on January 25 and the objective is to analyze Bell Let’s Talk social media campaign. Using a proposed tool called Netlytic (a community-supported text and social networks analyzer that automatically summarizes and discovers social networks from online conversations on social media sites) created by the course’s professor Dr. Anatoliy Gruzd I downloaded a tiny slice of #BellLetsTalk hashtagged data and created this super simple Power BI dashboard.

I have been wanting to play with Power BI’s Publish to Web functionality for quite some time and thought this was a great chance to give it a cool use. The data was exported from Netlytic as three CSV files and then imported into Power BI desktop. With the desktop tool I created a couple of simple measures (Total number of tweets and posts, Average number of tweets and posts per minute and so on) and then some simple visualizations.

Continue reading “Social Media Analytics: Bell Let’s Talk 2017”

Introduction to KNIME

KNIME is one of the many open source data analytics and blending tools available for free online.


This is a very basic presentation about KNIME I did at one of the labs as part of a Data Mining course in the Masters in Data Science and Analytics program at Ryerson University. The tool is really great and I ended up using it as the main analytics tool to deliver the final project for the same course.

Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark

Introduction

As part of the final exam assignment for my Masters in Data Science course “DS8003 – Management of Big Data Tools”, I created a Big Data TF-IDF index builder and query tool. The tool consists a script with functions to create a TF-IDF (term frequency-inverse document frequency) index and it is then used it to return matching queries for a list of terms provided and number of results expected.

Features Summary

  • Developed with PySpark, SparkSQL and DataFrames API for maximum compatibility with Spark 2.0
  • Documents to build the TF-IDF index can be on a local or HDFS path
  • Index is stored in parquet format in HDFS
  • Query terms and number of results are specified via command line arguments/li>

Continue reading “Implementing a TF-IDF (term frequency-inverse document frequency) index with Python in Spark”

Analyzing Reddit Public Comments on Azure Data Lake and Azure Data Analytics (Part 1.5)

In the previous article on this series, I skipped the part where I downloaded data. At first I used my laptop and a downloader to get the files locally, which I ended up uploading to the Azure Data Lake Store folders. Another alternative that I wanted to give a try and will show you in this post, is downloading the data directly into an Azure VM to a file share.

You can mount file shares inside Linux VMs with the only restriction that the VM has to be within the Azure infrastructure (apparently this is a limitation caused by the fact that mounting a SMB file share in Linux does not support encryption just yet). That’s the reason why we need to spin up an Azure VM to do this, if not it would be possible to do it directly from your own laptop (you can do this using a Windows downloader if you mount the Azure File Share in windows too). In this case I can download all files and have the 160GB of data available, with the goal of moving only the required files to the Data Lake Store when needed to run analyitcs.

Creating the share to store the data

1. Get a connection string to your storage account. This is the simplest way I could find to create services associated with storage through CLI

azure storage account connectionstring show [STORAGE_ACCOUNT_NAME]

2. Copy the connection string returned and set it to the AZURE_STORAGE_CONNECTION_STRING environment variable. Don’t forget the double quotes!

export AZURE_STORAGE_CONNECTION_STRING="[CONNECTION_STRING]"

3. Create the file share. You will be able to mount this from the VM you will create right after. By default, this share will have a limit of 5TB, sufficient enough for the 160GB we will download.

azure storage share create [SHARE_NAME]

Creating an Azure Linux VM using CLI

I’ve been good friends with Ubuntu for quite some time now, so I will create a minimal instance of an Ubuntu Server LTS. I only need to have the VM running while downloading and transferring files into the larger storage.

1. Register the network and compute providers

azure provider register Microsoft.Network
azure provider register Microsoft.Compute

2. Quick create the VM. After several trial and error runs, and reading some hidden documentation, I found the command line option to select the VM size (Basic_A0 is the smallest instance you can get). The command will prompt for the Resource Group Name, Virtual Machine Name, Location Name (has to be the same as the resource group!), Operating System, Username and Password. It will go through several steos (creating a storage account, creating a NIC, creating an IP configuration and public IP) and finally it will create your VM (I really appreciate that I don’t have to go through all those steps myself!).

azure vm quick-create -z Basic_A0 -Q UbuntuLTS

This command will come back with some info (notably the Public IP address and FQDN) that you can use to connect to your VM right away….

3. Connect to your newly minted VM using SSH, and the credentials you entered in the previous step.

4. Install tools to mount and mount the file share. I used “data” as my mount point, so I did a mkdir data in my home directory.


sudo apt-get install cifs-utils
sudo mount -t cifs //[ACCOUNT_NAME].file.core.windows.net/[SHARE_NAME]./[MOUNT_POINT]-o vers=3.0,username=[ACCOUNT_NAME],password=[STORAGE_ACCOUNT_KEY_ENDING_IN_==],dir_mode=0777,file_mode=0777

If you want to check if this is working, you can copy a local file to the mount point and use the Azure Management portal to check if the file was uploaded correctly.

5. Install transmission, get the tracker file and start downloading. The -w option is to indicate where to download the files, in this case all data goes to the file share (as the VM HDD size is just too small).


sudo apt-get install transmission-daemon
sudo /etc/init.d/transmission-daemon start
sudo apt-get install transmission-cli
wget http://academictorrents.com/download/7690f71ea949b868080401c749e878f98de34d3d.torrent
transmission-cli -w ./data 7690f71ea949b868080401c749e878f98de34d3d.torrent

6. Wait patiently for a couple of hours (around 5-6 hs) until your download completes… The next step would be to setup an Azure Data Factory pipeline to move the data from File Share to the Data Lake Store.