(2023-10) Update: see this post to return pydantic models
Let's write a simple function which returns data in a sane and organized way from a SQLite3 database using only the Python 3 standard library. We will be working in Python 3.8 for this example and will be using some very basic SQL statements. If you are new to SQL, I highly recommend this course: https://www.executeprogram.com/courses/sql
I wrote a gist as an example program to test and demonstrate this concept as well.
Problem
Say we have a database containing the results of an experiment you just ran containing the following fields (SQLite3 datatypes):
- fpath:
TEXT
(Filepath with the original, un-analyzed data) - n_measure:
INTEGER
(Replicate number) - treatment:
TEXT
(Treatment group) - amplitude:
REAL
(The measure we are interested in, amplitude of whatever)
You are planning on analyzing and making a plot of the data using Python or R or whatever you use and you need to get the data out of the SQLite3 database. How do you do it?
Making the sample database
First, let's create a simple database to work with:
import sqlite3
SAMPLE_DB = "sample_db.db" # path to our example database
# Schema defining the table holding our data
SCHEMA = """CREATE TABLE data (fpath TEXT, n_measure INTEGER, treatment TEXT,
amplitude REAL)"""
# make the database by connecting
temp_db = sqlite3.connect(SAMPLE_DB)
# Create the table, commit it, and close the connection
temp_db.execute(SCHEMA)
temp_db.commit()
temp_db.close()
The set of commands above will create an SQLite3 database with a table called data
to hold our data. For our example, we will add some sample data to our database.
test_data = [
{
"fpath": "path/to/file/one.dat",
"n_measure": 1,
"treatment": "Control",
"amplitude": 50.5,
},
{
"fpath": "path/to/file/two.dat",
"n_measure": 2,
"treatment": "Control",
"amplitude": 76.5,
},
{
"fpath": "path/to/file/three.dat",
"n_measure": 1,
"treatment": "Experimental",
"amplitude": 5.5,
},
]
We can add the data to our database in a simple loop using an Python SQLite3 parameterized statement (this one uses named parameters, allowing you to add directly from a dictionary):
temp_db = sqlite3.connect(SAMPLE_DB) # re-connect to the sample database we just made
for item in test_data:
temp_db.execute(
"INSERT INTO data (fpath, n_measure, treatment, amplitude) VALUES(:fpath, :n_measure, :treatment, :amplitude)",
item,
)
temp_db.commit() # commit after each addition
print(f"Added data {item['fpath']}") # print a helpful message once added
temp_db.close()
# Added data path/to/file/one.dat
# Added data path/to/file/two.dat
# Added data path/to/file/three.dat
The data is now in our example database, now how would we get it out?
Getting the data out, first attempt
Here is the simplest way to get all the data:
temp_db = sqlite3.connect(SAMPLE_DB)
data = temp_db.execute("SELECT * FROM data").fetchall()
print(data)
# [('path/to/file/one.dat', 1, 'Control', 50.5),
# ('path/to/file/two.dat', 2, 'Control', 76.5),
# ('path/to/file/three.dat', 1, 'Experimental', 5.5)]
temp_db.close()
That works, but it returns a list of tuples, without any information about what it what. For a small database like this that's not a problem, but if we had a big set of returned columns, it could get confusing. Let's explore an alternate method.
Getting the data out, second attempt
SQLite3 contains a helper called a row_factory, and setting it on your connection
object allows you to access values by either index or name (see more docs here):
temp_db = sqlite3.connect(SAMPLE_DB)
temp_db.row_factory = sqlite3.Row # set row factory on the connection object
values = temp_db.execute("SELECT * FROM data WHERE treatment = 'Experimental'").fetchall()
temp_db.close()
This will return an iterator, so let's select the first item (there is only one item, remember we only added one Experimental
):
inspect = values[0]
print(inspect)
# [<sqlite3.Row at 0x10d3f55b0>]
This returns an sqlite3.Row
object. We can now access the values in the object in two different ways. First, by index:
print(inspect[0])
# 'path/to/file/three.dat'
print(inspect[2])
# 'Experimental'
That's still not very useful. But the other way you can get the items out is using the (case insensitive) column name:
print(inspect["fpath"])
# 'path/to/file/three.dat'
print(inspect["treatment"])
# 'Experimental'
That is looking more useful. It looks like we can treat the column names as a dictionary:
print(inspect.keys())
# ['fpath', 'n_measure', 'treatment', 'amplitude']
print(inspect.items()) # fail
# AttributeError: 'sqlite3.Row' object has no attribute 'items'
OK so it is a kinda/sorta dictionary, but not an actual dictionary. Dictionaries are useful data structures in Python, they are easy to work with, easy to serialize, and can be quickly converted to a clear serialization format like JSON
. Let's use the Python SQLite3 row_factory
to extract the values into a dictionary. Remember the query we executed above (SELECT * FROM data WHERE treatment = 'Experimental'
) returns only the data from the "Experimental" group (which is only one item). We can extract the values using a dictionary comprehension:
original = {k: inspect[k] for k in inspect.keys()}
print(original)
# {'fpath': 'path/to/file/three.dat',
# 'n_measure': 1,
# 'treatment': 'Experimental',
# 'amplitude': 5.5}
Dictionary comprehensions are similar to list comprehension, but in the case of a dictionary. In that one line statement, we are iterating through the keys in our sqlite3.Row
, and using the keys to create a dictionary with the key (k
) and the value inspect[k]
for all k
's in the sqlite3.Row
. List/dictionary comprehensions are nice because they are succinct, fast, and avoid mutable "accumulator" variables.
That method looks like it worked! In this case we only unpacked one item. Let's write one to unpack an arbitrary amount of items, like if we SELECT *
.
temp_db = sqlite3.connect(SAMPLE_DB)
temp_db.row_factory = sqlite3.Row
values = temp_db.execute("SELECT * FROM data").fetchall()
list_accumulator = []
for item in values:
list_accumulator.append({k: item[k] for k in item.keys()})
print(list_accumulator)
# [{'fpath': 'path/to/file/one.dat',
# 'n_measure': 1,
# 'treatment': 'Control',
# 'amplitude': 50.5},
# {'fpath': 'path/to/file/two.dat',
# 'n_measure': 2,
# 'treatment': 'Control',
# 'amplitude': 76.5},
# {'fpath': 'path/to/file/three.dat',
# 'n_measure': 1,
# 'treatment': 'Experimental',
# 'amplitude': 5.5}]
temp_db.close()
Awesome! I mentioned list comprehensions above, we could actually make this loop more succinct and remove the need for the mutable list_accumulator
variable by combining the dictionary comprehension with a list comprehension:
list_comp_version = [{k: item[k] for k in item.keys()} for item in values]
print(list_comp_version)
# [{'fpath': 'path/to/file/one.dat',
# 'n_measure': 1,
# 'treatment': 'Control',
# 'amplitude': 50.5},
# {'fpath': 'path/to/file/two.dat',
# 'n_measure': 2,
# 'treatment': 'Control',
# 'amplitude': 76.5},
# {'fpath': 'path/to/file/three.dat',
# 'n_measure': 1,
# 'treatment': 'Experimental',
# 'amplitude': 5.5}]
We do this at the risk of being less clear ('too clever'), but in this case I think it is worth it to get rid of loops and mutable variables.
Writing a function
We have solved our original problem, but the cool thing is that this statement is general and can easily be re-used. Rather than re-writing it every time, we can generalize this into a function that we can import and use later:
def sql_data_to_list_of_dicts(path_to_db, select_query):
"""Returns data from an SQL query as a list of dicts."""
try:
con = sqlite3.connect(path_to_db)
con.row_factory = sqlite3.Row
things = con.execute(select_query).fetchall()
unpacked = [{k: item[k] for k in item.keys()} for item in things]
return unpacked
except Exception as e:
print(f"Failed to execute. Query: {select_query}\n with error:\n{e}")
return []
finally:
con.close()
I've added a few error/exception handlers, but this is very similar to our original code. Here is how to use it with the database we already made:
QUERY = "SELECT * FROM data"
returned_data = sql_data_to_list_of_dicts(SAMPLE_DB, QUERY)
print(returned_data)
# [{'fpath': 'path/to/file/one.dat',
# 'n_measure': 1,
# 'treatment': 'Control',
# 'amplitude': 50.5},
# {'fpath': 'path/to/file/two.dat',
# 'n_measure': 2,
# 'treatment': 'Control',
# 'amplitude': 76.5},
# {'fpath': 'path/to/file/three.dat',
# 'n_measure': 1,
# 'treatment': 'Experimental',
# 'amplitude': 5.5}]
Wrapping up: use the standard library if you can
This is a nice way to work when programming: Solve your original problem, then think about how to generalize it into a function for re-use.
Sometimes the generalization is not worth the trouble, but in this case, we have a very useful function for serializing and returning data using only the python standard library. Simplicity is essential for re-usable code. While no doubt pandas
has great helper functions for similar tasks, if you don't have to take on a heavy dependency you probably shouldn't. The python standard library is powerful (and portable) and this will work anywhere python3 is available, no pip install pandas
or virtual environments needed.
You can run this example using this github gist.