Accessing data in Python

Introduction

To build Datapane reports, you'll usually start by importing data into your Python environment. This is an overview of the main data sources and how to access them:
  1. 1.
    Local files
  2. 2.
    Databases
  3. 3.
    APIs
  4. 4.
    Data access libraries
The only major requirement is installing the pandas library:
Terminal
Jupyter
1
$ pip install pandas
2
OR
3
$ conda install pandas
Copied!
1
!pip install pandas
Copied!

Local files

Often the data you need is stored in a local file on your computer. Depending on where you're running your Python environment, you can either specify the filename as a relative or absolute path:
1
# Absolute path
2
file1 = "~/Users/johnreid/Documents/my_project/data/example.csv"
3
# Relative path, assuming current working directory is my_project
4
file2 = "./data/example.csv"
Copied!

CSV files

CSVs are a popular choice for storing tabular data, here we'll look at a population dataset from Our World in Data using pd.read_csv:
1
import pandas as pd
2
3
csv_file = "/Users/johnreid/Downloads/population-by-country.csv"
4
df_from_csv = pd.read_csv(csv_file)
5
df_from_csv.info()
Copied!
After importing the data, it's helpful to run df.info() to understand how your data is structured e.g. how many rows, columns and non-null values you have. Running that code gives us the following output:
If you keep getting aFileNotFoundError, try renaming your filename to replace spaces with underscores e.g. "Financial Sample.xlsx" becomes "Financial_Sample.xlsx".

Excel files

You need to be a bit more cautious with Excel files, because they may contain more than one sheet of data and complex visual formatting e.g. extra header rows. Otherwise the syntax is pretty similar - here's a financial data example:
1
import pandas as pd
2
excel_file = "/Users/johnreid/Downloads/Financial_Sample.xlsx"
3
df_from_excel = pd.read_excel(excel_file, sheet_name = "Sheet1")
4
df_from_excel.info()
Copied!

Text

Text files often need more data processing - start by looking at how the data is stored and how you'd like to represent it in Python. From there, you can write code to transform textual input into a dataframe. Let's use a shopping list example, with each line containing an item and a quantity:
To convert that to a dataframe, you can try the following:
1
shopping_list = "/Users/johnreid/Downloads/shopping_list.txt"
2
3
results = []
4
5
with open(shopping_list) as f:
6
line = f.readline()
7
while line:
8
results.append(line.strip().split(" "))
9
line = f.readline()
10
11
f.close()
12
13
df_from_textfile = pd.DataFrame(results, columns = ["Item", "Quantity"])
Copied!
We read the lines one-by-one, strip extra whitespaces and split the line into two parts. When we create a dataframe, we also need to assign column names.

Multiple files / folders

Let's combine a couple of things that we've learned to extract data from the BBC Sport text dataset.
We have 5 subfolders, each with around 100 files. Each file starts with a headline, followed by the body of the article. Our goal will be to combine all these files into a single dataframe with 'Title', 'Subtitle', 'Body' and 'Genre' columns. The glob library comes really in handy:
1
import glob
2
import pandas as pd
3
4
base_path = "/Users/johnreid/Downloads/bbcsport/"
5
genres = ["athletics", "cricket", "football", "rugby", "tennis"]
6
7
def read_and_split_file(filename):
8
with open(filename, 'r', encoding="latin-1") as f:
9
lines = f.readlines() # Get lines as a list of strings
10
lines = list(map(str.strip, lines)) # Remove /n characters
11
lines = list(filter(None, lines)) # Remove empty strings
12
return lines
13
14
def get_df_from_genre(path, genre):
15
files = glob.glob(path + genre + "/*.txt")
16
titles = []
17
subtitles = []
18
bodies = []
19
20
for f in files:
21
lines = read_and_split_file(f)
22
titles.append(lines[0]) # First line is the title
23
subtitles.append(lines[1]) # Second line is the subtitle
24
bodies.append(' '.join(lines[2:])) # Combine all the rest
25
26
return(pd.DataFrame({
27
'genre': genre,
28
'title': titles,
29
'subtitle': subtitles,
30
'body': bodies
31
})
32
)
33
34
final_df = pd.concat([get_df_from_genre(base_path, g) for g in genres])
35
final_df
Copied!
Note that you can concatenate multiple dataframes together using pd.concat.
Running that code gives us the following output:

Databases

Most organizations store their business-critical data in a relational database like Postgres or MySQL, and you’ll need to know Structured Query Language (SQL) to access or update the data stored there.

SQLite

SQLite is an embedded database that is stored as a single file, so it's a great place to start testing out queries. Here we'll show an example of connecting to a SQLite file of the Chinook database:
1
import pandas as pd
2
import sqlite3 as sql
3
4
conn = sql.connect('/Users/johnreid/Downloads/chinook.db')
5
6
# First pattern - turn query directly into dataframe:
7
df1 = pd.read_sql_query("SELECT * FROM invoice", conn)
8
9
# Second pattern - get row-level data, but no column names
10
cur = conn.cursor()
11
results = cur.execute("SELECT * FROM invoice LIMIT 5").fetchall()
12
df2 = pd.DataFrame(results)
Copied!

Remote databases

Connecting to a remote database like Postgres, Redshift, or SQLServer uses mostly the same syntax but requires access credentials. For security reasons, it's best to store these credentials in a config file and load them into your Python script. You can create a separate .py file like this:
config.py
1
host = "localhost"
2
database= "suppliers"
3
user = "postgres"
4
password = "SecurePas$1"
Copied!
and then import it into your Python script as follows (you'll also need the psychopg2 library):
1
import psycopg2
2
import config
3
4
conn = psycopg2.connect(
5
host=config.host,
6
database=config.database,
7
user=config.user,
8
password=config.password)
9
10
## Run queries to your heart's delight!
Copied!
Make sure to keep your config.py file safe and don't upload it elsewhere - you can add it to your .gitignore to make sure it doesn't get included in git commits.

SQLAlchemy

If you want a more ‘pythonic’ way of querying a database, try the SQLAlchemy library, which is an Object-Relational-Mapper. It’s typically used for applications so that developers don’t have to write pure SQL to update their database, but you can use it for querying data too!
Here’s an example using the same Chinook music store database:
1
import sqlalchemy as db
2
import pandas as pd
3
4
engine = db.create_engine('sqlite:///chinook.db')
5
connection = engine.connect()
6
metadata = db.MetaData()
7
invoice = db.Table('invoice', metadata, autoload=True, autoload_with=engine)
8
9
# Get the first 10 invoices from the USA
10
query = (
11
db.select([invoice])
12
.filter_by(billing_country = 'USA')
13
.limit(10)
14
)
15
16
df = pd.read_sql(query, engine)
Copied!
In this code we connect to the database, then set up some tables & metadata in SQLAlchemy. Once that’s defined, we can write a query in a more ‘pythonic’ way and read the results directly to a Pandas dataframe. Running that code gives the following output:

APIs

Sometimes you'll need to access data from a particular platform your company uses, like Hubspot, Twitter or Trello. These platforms often have a public API that you can pull data from, directly inside your Python environment.
The basic idea is you send a request (which may include query parameters and access credentials) to an endpoint. That endpoint will return a response code plus the data you asked for (hopefully). The most common response codes are:
  • 200: Everything went okay, and the result has been returned.
  • 301: The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.
  • 400: The server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.
  • 403: The resource you’re trying to access is forbidden: you don’t have the right permissions to see it.
  • 404: The resource you tried to access wasn’t found on the server.
  • 503: The server is not ready to handle the request.
You'll need to look at the API documentation to understand what data fields are available. The data will usually be returned in JSON format, which allows for deeply-nested data. Let's do a minimal example using the OpenNotify API, which tracks all the people currently in space:
1
import requests
2
3
response = requests.get("http://api.open-notify.org/astros.json")
4
print(response.status_code)
5
print(response.json())
6
7
res = pd.DataFrame(response.json()["people"])
8
res.head()
Copied!
Running that code gives us the following output:
From here, try including query parameters or access credentials for your favourite API!
If you don't want to deal with JSON you can try searching for a Python library for that API - these are usually open-source and maintained by the company or third parties.

Data Access Libraries

Pandas_datareader

Pandas_datareader is a great way to pull data from the internet into your Python environment. It is particularly suited to financial data, but also has some World Bank datasources. To get Zoom's daily share price over the past few years, try the following:
1
!pip install pandas_datareader
2
from pandas_datareader import data
3
import datetime as dt
4
5
# Only get the adjusted close.
6
zm = data.DataReader("ZM",
7
start='2019-1-1',
8
end=dt.datetime.today(),
9
data_source='yahoo').reset_index()
10
11
zm.head()
Copied!
Running that code gives us the following output:

DataCommons

Datacommons is a project by Google providing access to standardized and cleaned public datasets. The underlying data is represented in a graph format, making it really easy to query and join data from many different datasources e.g. the US Census, World Bank, Wikipedia, Centre for Disease Control and more. Here's a basic example:
1
!pip install datacommons datacommons_pandas --upgrade --quiet
2
3
import datacommons, datacommons_pandas as dc
4
import pandas as pd
5
6
city_dcids = dc.get_property_values(["CDC500_City"], "member", limit=500)["CDC500_City"]
7
8
cdc500_df = dc.build_multivariate_dataframe(
9
city_dcids,
10
[
11
"Percent_Person_Obesity", # Prevalence of obesity from CDC
12
"Median_Income_Person",
13
"Median_Age_Person",
14
"UnemploymentRate_Person", # Unemployment rate from BLS
15
"Count_Person_BelowPovertyLevelInThePast12Months", # Persons living below the poverty line from Census
16
"Count_Person", # Total population from Census
17
],
18
)
19
20
cdc500_df.info()
Copied!
Running that code gives us the following:
PyTrends is an unofficial but useful library for querying Google Trends data - here's a simple example:
1
import pandas as pd
2
from pytrends.request import TrendReq
3
4
pytrends = TrendReq()
5
keywords = ["oat milk", "soy milk", "almond milk"]
6
pytrends.build_payload(keywords, cat=0, geo='', gprop='') # Get data from the last 5 years
7
top_queries = pytrends.interest_over_time()[keywords]
8
9
top_queries.head()
Copied!
Running that code gives us the following output:

Kaggle

Kaggle is a data science community that hosts a lot of datasets and competitions for learning Python. You can download some of these datasets to play around with through their command-line interface (note: you'll need to sign up for a Kaggle account). For example, say we want to download some Zillow economics data:
Terminal
1
$ pip install kaggle
2
$ export KAGGLE_USERNAME=datadinosaur
3
$ export KAGGLE_KEY=xxxxxxxxxxxxxx
4
$ kaggle datasets download zillow/zecon
5
$ unzip zecon.zip
Copied!
This will download a zipped file of the datasets, and then uncompress them. From there, you can open them as local files with Pandas:
1
import pandas as pd
2
3
csv_file = "/Users/johnreid/Downloads/Zip_time_series.csv"
4
df_from_csv = pd.read_csv(csv_file)
5
df_from_csv.info()
Copied!
To learn more, check out the Kaggle API documentation.
Last modified 1mo ago