Analyzing Reddit Public Comments on Azure Data Lake and Azure Data Analytics (Part 1.5)

In the previous article on this series, I skipped the part where I downloaded data. At first I used my laptop and a downloader to get the files locally, which I ended up uploading to the Azure Data Lake Store folders. Another alternative that I wanted to give a try and will show you in this post, is downloading the data directly into an Azure VM to a file share.

You can mount file shares inside Linux VMs with the only restriction that the VM has to be within the Azure infrastructure (apparently this is a limitation caused by the fact that mounting a SMB file share in Linux does not support encryption just yet). That’s the reason why we need to spin up an Azure VM to do this, if not it would be possible to do it directly from your own laptop (you can do this using a Windows downloader if you mount the Azure File Share in windows too). In this case I can download all files and have the 160GB of data available, with the goal of moving only the required files to the Data Lake Store when needed to run analyitcs.

Creating the share to store the data

1. Get a connection string to your storage account. This is the simplest way I could find to create services associated with storage through CLI

azure storage account connectionstring show [STORAGE_ACCOUNT_NAME]

2. Copy the connection string returned and set it to the AZURE_STORAGE_CONNECTION_STRING environment variable. Don’t forget the double quotes!


3. Create the file share. You will be able to mount this from the VM you will create right after. By default, this share will have a limit of 5TB, sufficient enough for the 160GB we will download.

azure storage share create [SHARE_NAME]

Creating an Azure Linux VM using CLI

I’ve been good friends with Ubuntu for quite some time now, so I will create a minimal instance of an Ubuntu Server LTS. I only need to have the VM running while downloading and transferring files into the larger storage.

1. Register the network and compute providers

azure provider register Microsoft.Network
azure provider register Microsoft.Compute

2. Quick create the VM. After several trial and error runs, and reading some hidden documentation, I found the command line option to select the VM size (Basic_A0 is the smallest instance you can get). The command will prompt for the Resource Group Name, Virtual Machine Name, Location Name (has to be the same as the resource group!), Operating System, Username and Password. It will go through several steos (creating a storage account, creating a NIC, creating an IP configuration and public IP) and finally it will create your VM (I really appreciate that I don’t have to go through all those steps myself!).

azure vm quick-create -z Basic_A0 -Q UbuntuLTS

This command will come back with some info (notably the Public IP address and FQDN) that you can use to connect to your VM right away….

3. Connect to your newly minted VM using SSH, and the credentials you entered in the previous step.

4. Install tools to mount and mount the file share. I used “data” as my mount point, so I did a mkdir data in my home directory.

sudo apt-get install cifs-utils
sudo mount -t cifs //[ACCOUNT_NAME][SHARE_NAME]./[MOUNT_POINT]-o vers=3.0,username=[ACCOUNT_NAME],password=[STORAGE_ACCOUNT_KEY_ENDING_IN_==],dir_mode=0777,file_mode=0777

If you want to check if this is working, you can copy a local file to the mount point and use the Azure Management portal to check if the file was uploaded correctly.

5. Install transmission, get the tracker file and start downloading. The -w option is to indicate where to download the files, in this case all data goes to the file share (as the VM HDD size is just too small).

sudo apt-get install transmission-daemon
sudo /etc/init.d/transmission-daemon start
sudo apt-get install transmission-cli
transmission-cli -w ./data 7690f71ea949b868080401c749e878f98de34d3d.torrent

6. Wait patiently for a couple of hours (around 5-6 hs) until your download completes… The next step would be to setup an Azure Data Factory pipeline to move the data from File Share to the Data Lake Store.