Create an auto-updating dataset on Kaggle with Deepnote
Never leave a dataset behind with always up-to-date datasets.
Kaggle has become the de facto place to find data for data science projects. Even though the internet is filled with data, Kaggle is great because the data is often in an easy-to-use format. Unfortunately, as the years go by, Kaggle is suffering from the same problem as the rest of the internet.
Datasets get old, and as they age, their usefulness quickly declines. What’s worse, not only do they become almost useless, they actively clog up the system for everyone else. Scrolling through datasets takes more and more time for everyone as a result. To do my part to help the internet, in this tutorial, I want to show you how to create your very own auto-updating dataset on Kaggle for free using Deepnote.
If you don’t know what Deepnote is, it’s a collaborative data notebook that is Jupyter compatible. They have a generous free tier that allows you to perform analysis, publish public notebooks, and schedule them to run on your schedule. This last part, scheduling, is what we will use for this tutorial.
In this tutorial, we will set up our Deepnote environment to work with Kaggle, pull and clean some data from the internet, and then schedule the data to auto-update using Deepnote and the Kaggle library. We will use average price data from the Bureau of Labor and Statistics (BLS). This data contains average prices for cities and regions for both food and non-food items.
Notably, the BLS already has a Kaggle account and some datasets. Unfortunately, they haven’t been updated since 2017! So let’s create our own Kaggle dataset that is always fresh to help out our fellow data scientists.
To get started, you should already have a Deepnote and Kaggle account set up and ready to go.
Step 1: Setup Kaggle API Keys
We will be using Kaggle’s API to create and update our dataset. This API requires a token, which we can generate on the Kaggle account page. Head to https://kaggle.com/{your_user_name}/account
and find the API section, then click on Create New API Token.
This will download a JSON file (called kaggle.json
). You can open this file in a simple text editor, and you will see that it contains your username and API key. There are two ways to use this:
- Place the JSON file in the location
~/.kaggle/kaggle.json
. - Set environment variables for your username and API key.
Using the JSON file is fine if you use your local machine, but I greatly prefer the environment variable method because it gives a few advantages.
Using environment variables, you don’t risk accidentally exposing your API key to others if you share your project. Also, environment variables can be shared across Deepnote projects via integrations. This allows you to set your environment variables once and reuse them across several projects.
Create Deepnote Integration
Let’s create an environment variable integration in Deepnote. First, click the Integrations button on the left pane of your Deepnote workspace. Then click on the Environment variable integration button. Add the following details:
- Integration Name: Any logical name, something like kaggle_api_keys.
- Key/Value #1: KAGGLE_USERNAME, your Kaggle username.
- Key/Value #2: KAGGLE_KEY, the API key from the JSON file you downloaded.
Connect Integration to Project
Now that our integration is ready, we can connect it to our project so we can use the Kaggle Python package. Create a new project in Deepnote, then click on the Integrations button on the right pane. You should see the integration you created earlier. Click Connect and your environment variables are auto-magically connected to your project.
Now that our environment variables are loaded into our project we can import both the pandas
and kaggle
packages.
Note: If you don’t already have the
kaggle
packages installed you can install them in Deepnote using!pip install kaggle
in a cell. Once installed you will be prompted to add the package to therequirements.txt
file. Click on that or add it yourself. Either way, Deepnote will ensure that all packages in yourrequirements.txt
file are installed before your code is run.
We have 189 thousand rows of prices. These range from January 1995 to July 2022.
For some reason the column names have some extra spaces, so I will set these again to remove those. This will ensure our tables get merged properly later.
The important column in this dataset is series_id
. It tells us the item and area code for the given row. The BLS helpfully gives a breakdown of the series_id
column, so we can understand how it is structured.
The series_id
(APU0000701111) can be broken out into:
- survey abbreviation — AP
- seasonal(code) — U
- area_code — 0000
- item_code — 701111
Given this info, we need to break down our data by area and item. After scrolling through the data, it seems that the season doesn’t change in our data. Let’s pull down the item and area tables next.
We can see that our item table is a detailed item name of the product corresponding to the item code. Items include both food and non-food items like electricity and gasoline.
Our area table includes cities, regions, and region/size class groupings.
Now we will merge the data into a single table that can be added as a Kaggle Dataset. We will do the merging in two different ways. Functionally they are the same, but I wanted to show both here as a kind of mini-tutorial.
The first is with more typical pandas
code where we will perform an operation and then save the dataframe. The second is by chaining pandas
methods together. I'm not new to chaining, but Matt Harrison recently inspired me to take chaining to a new level to write cleaner, more readable pandas
code.
Typical Pandas Code
Let’s show how we would typically do this in pandas
. I'll make a copy of the dataframe so we can work on it without changing the original.
We will split out the area and item codes from the series_id. These are a predictable length, so we can simply use slicing. If the item code is shorter than the 6 characters (some of them are), we can simply strip out the extra spaces that would be generated at the end.
Now we can merge our dataframes to get text descriptions of these items and area codes.
We have an extra column (footnote_codes
). It's all NaN's we can remove it.
Chained Methods
The chained code below is one more line than the code above because of the end parenthesis. It uses indentation to highlight the methods, with the goal being that this code is more readable.
To write code such as this, you go line by line and build the chain, executing the cell each time. I’m still not totally sold on this, though, because you do miss the ability to have markdown next to each chunk of code. It’s certainly less verbose, though.
We can save our dataframe to a CSV file, and now our data is ready to be added to Kaggle!
Step 3: Add Dataset to Kaggle
The Kaggle API allows you to interact with competitions, datasets, and notebooks. For this tutorial, we will only be working with datasets. The documentation for the Kaggle API is here.
The Kaggle package is designed to work via the CLI. Luckily we can do this in a notebook by simply adding a !
in front of the command.
We can see that the Bureau of Labor and Statistics has three datasets on Kaggle already. Unfortunately, they haven’t been updated since 2017! This should be further motivation to ensure that any dataset you upload is refreshed regularly. An old dataset is only useful to solve yesterday’s problem.
The kaggle
library expects that our dataset is in its folder, so let's create a folder and move our CSV file over.
!mkdir bls-average-prices
!cp average_prices.csv bls-average-prices
You should now have a folder in your project files called bls-average-prices
with the average_prices.csv
file inside.
We will initialize our dataset metadata file with the init
command. This will generate a file in our dataset folder called datapackage.json
.
!kaggle datasets init -p bls-average-prices
You can now open the JSON file and add the title
and id
that you would like. The default license is CC0-1.0, which is a public domain license. As our data is in the public domain, this works fine. You can find many more items that can be added here.
Your JSON file should at minimum look something like this.
Now we can create our dataset with the create command. This will look at the JSON file and create the dataset with your data in our folder with the title
and id
you put in.
!kaggle datasets create -p bls-average-prices
If everything works right, you should see something like the output above. The library will give you a private URL you can go to add additional metadata. To create a high-quality dataset on Kaggle, much more metadata is needed (source, description, etc.), so it’s a good idea to take the time here to update this.
Also, keep in mind that your dataset isn’t public just yet. Once you are happy with the added details, you can make it public at the web interface above. I went ahead and did this, and you can see the result here.
Create a new dataset version
The idea is to update this dataset regularly (so we don’t have old datasets clogging up the internet pipes). To do this, we will run our notebook on a schedule. Each notebook run will pull down the most recent data and create a new dataset version on Kaggle.
We don’t want to make a new dataset each time so that you can comment out the init
and create
commands above by using the # character at the beginning of the line.
To create a new dataset version, we use the version command and give an update message.
!kaggle datasets version -p bls-average-prices -m "Automatic Update"
There you have it. Now we have all the code needed to update our dataset on whatever schedule we want.
Step 4: Schedule Notebook
Scheduling a notebook in Deepnote is very simple. At the top of the notebook, you will see the Run notebook button with a down arrow. Click on the down arrow and select Schedule notebook. Here you can set the frequency of updates, and also whether Deepnote should send an email if the run succeeds or fails.
Our data is updated monthly, but we can set it weekly as Deepnote does not currently allow for a monthly update.
Conclusion
Now you can sit back and enjoy the wonder of auto-updating datasets. You can feel the satisfaction of knowing that you aren’t contributing another dataset that will be old in a few months. After all, we as data scientists should be solving today’s problems, not yesterday’s.
Thanks for reading. This post originally appeared on Datafantic. If you want to receive updates via email whenever I post something new, sign up for the Datafantic newsletter.
If you like Medium and you would like to directly support my articles you can use my referral link to sign up to become a Medium member.