Beginner’s Data Engineering: The Simple-Spooky Pipeline

Rupert Adams
5 min readOct 26, 2019
Tis the season! Source: https://gifer.com/en/xM

It’s a dark and stormy night, the wind is howling and the branch from that tree you should really cut back is tapping at your window; your only source of light is the dim glow of your laptop as you scour the web for something both terrifying and reasonably informative. Suddenly the words flash across your screen, and begin to haunt your weary mind:

Simple-Spooky Pipeline…

What is a Simple-Spooky Pipeline? Why, it is the simplest, spookiest introduction to data engineering I could come up with of course! Having gone through various blogs on the subject I’ve found there to be a distinct lack of fun and informative posts to help people get into data engineering and analytics. So here it is! Written in Python! My simple, yet spooky pipeline tutorial. This post will teach you how to:

  • Scrape a webpage for a table
  • Download that table into a CSV file
  • Clean that data using Dagster, a data pipeline package tool, and export that to a new file, ready to be read into a notebook

So are you ready? Do you have your IDE set up? Are you sure that noise behind you wasn’t an unhinged murderer who has just escaped an asylum you didn’t know was nearby? Good. Then let us begin…The Simple-Spooky Pipeline!

The Simple-Spooky Scraper

The first thing we’re going to need is a data source. Luckily for us the Friday the 13th Wiki has a convenient set of tables with all the teenagers who have been unfortunate enough to visit Camp Crystal Lake. So, inside a src folder:

src/spooky_scraper.py

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
def log_error(e):
"""
It is always a good idea to log errors.
This function just prints them, but you can
make it do anything.
"""
print(e)
def is_good_response(resp):
"""
Returns True if the response seems to be HTML, False otherwise.
"""
content_type = resp.headers['Content-Type'].lower()
return (resp.status_code == 200
and content_type is not None
and content_type.find('html') > -1)

def spooky_get(url):
try:
with closing(get(url, stream=True)) as resp:
if is_good_response(resp):
return resp.content
else:
return None

except RequestException as e:
log_error('Error during requests to {0} : {1}'.format(url, str(e)))
return None

Using the first two error logging methods, the main method: ‘spooky_get’ can find a webpage and its elements so long as you give it the url as a string. Now we want to scrape the given webpage to find our devilish data: which in this case is the first table, detailing all the deaths from the first Friday the 13th film. This is achieved through an amazing Python package called BeautifulSoup. We need to add the below code to our scraper:

src/spooky_scraper.py

from bs4 import BeautifulSoup
import csv
if __name__ == '__main__':
raw_html = spooky_get('https://fridaythe13th.fandom.com/wiki/List_of_deaths_in_the_Friday_the_13th_films')
html = BeautifulSoup(raw_html, 'html.parser')
table = html.find("table")

output_rows = []
for table_row in table.findAll('tr'):
columns = table_row.findAll('td')
output_row = []
for column in columns:
output_row.append(column.text)
output_rows.append(output_row)

with open('happy_campers.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(output_rows)

print('Boo!')

if __name__ == ‘__main__’: is a common Python script that runs the file when it is called in the command line:

ARGH!

Running this will create our first file: happy_campers.csv. Hurray! We’ve just scraped the web and downloaded data! Terrifying data!

The Simple-Spooky Pipeline

Now that we have the file. Let’s read it into a notebook:

Oh god! It’s HIDEOUS!

Well, it looks like a table of deaths alright, but what is a ‘knife\n’? and what on earth does the ‘#’ column show us? This data is filthy (as most data is!) so we need to find a way to clean it up.

Enter Dagster, a pipeline package for Python. Dagster has lots of awesome features for data engineering that are worth a look and they are expanding their platform rapidly. For this project however, we will simply make use of their solid and pipeline features:

src/spooky_pipeline.py

import pandas as pd
from dagster import solid, pipeline


@solid
def clean_code(_, file):
filthy_file = pd.read_csv(file)
filthy_file.replace(['\n','1','2','3'],' ', True, regex=True)
filthy_file.columns = filthy_file.columns.str.strip()
filthy_file.set_index('Name', inplace=True)
del filthy_file['#']
del filthy_file['Notes']
filthy_file.to_csv('clean_campers.csv')


@pipeline
def simple_spooky_pipeline():
clean_code()

Using the Pandas library to manipulate the data, our solid: ‘clean_code’ opens our CSV file, and cleans various aspects of the data such as removing the ‘\n’s, deleting the useless columns and setting the index of the table to ‘Name’ (‘inplace=True’ is used to change the existing table rather than making a new one). The final piece to this creepy puzzle is our instructions file. The Dagster pipeline needs to know what metadata to run with, in this case the ‘file’ argument in our solid. This may not seem important with only one table, one solid and one pipeline, but as more is added to a data project, it makes sense to separate the responsibilities to different files. So the last file, a YML file, will look like this:

spooky_instructions.yml

solids:
clean_code:
inputs:
file:
value: 'happy_campers.csv'

Now we just have to run Dagster and watch the magic happen:

Thats a mouthful, and possibly a little too spooky.

This command runs the data-frame that is read from the first file and saves it to a new file called ‘clean_campers.csv’. Because the code is clean not the campers, I imagine the campers are mostly covered in blood and viscera and are anything but clean. So let us run that code in the notebook again with the new file and see what comes out:

Oh God! It’s BEAUTIFUL!

And there we have it: a simple, spooky, end-to-end pipeline. You can now find data on the web, download it, clean it and read it. Of course this isn’t perfect. There are still odd characters like the ‘[]’ and we haven’t scraped data from the other tables on the page, but it is a good start. Maybe you could clean the data even further? Scrape even more data? Manipulate the data in the notebook to uncover its secrets? The choice is yours! You better hurry though, Halloween is just around the corner and creating simple-spooky pipelines might be frowned upon during Christmas. It may become difficult to explain why you’ve been working with such gruesome data, making small talk a little awkward. AWKWARD I TELL YOU! MWAHAHAHA.

--

--

Rupert Adams

Vienna-based human-coffee hybrid. Data/Infrastructure Engineer and System Architect. Kills 99.9% Bacteria.