DuckDB : DBMS for Social Scientists analyzing large datasets

Posted on Sep.15, 2024

Analyzing big data in the social sciences is still not very common.
However, the opportunity to analyze gigabytes of data is growing. The biggest problem will probably be the insufficient memory capacity.

Operating system, R, web browser, and other applications use your computer's memory. Although you may think you have enough memory to handle the data, you actually need more memory to load, explore, and manipulate the large data sets.

The solution to this problem is quite obvious.
Buy an expensive, high-performance computer. If your data gets updated and even bigger, you can just buy a new, more expensive computer and you won't have to worry about the problem.
However, only a few will have such a solution available.

DBMS(DataBase Management System) is a reasonable and immediately available alternative to the purchase of a new computer.

But learning SQL itself is another challenge.
Aside from learning a new language, you may find yourself overwhelmed by the challenges of software selection and even installation.
I tried to study MYSQL and SQLite. I got overwhelmed and lost in learning SQL. Then I lost even sight of why I wanted to use it in the first place. Until I discovered DuckDB, I had forgotten all about SQL.

DuckDB can be a good DBMS solution for social scientists who want to analyze large datasets. First of all, it is easy to set up and use. We can think of DuckDB as a library or package of R/Python. According to the developer (Hannes Mühleisen), DuckDB is R-friendly. This is attributed to the influence of interactions with the R community on the overarching concept of DuckDB.

In this post, I am going to show how to use DuckDB in R and why it is a good solution.

Install DuckDB

install.packages("duckdb")            

# In Python,
pip install duckdb

Done! It has no dependencies.

Create DB and Table

library(duckdb)
library(tidyverse) # data manipulation

con_db= dbConnect(duckdb(), dbdir="./r_duck.db")            

or

con_db= dbConnect(duckdb())

dbConnect connects to the existing database(db name is con_db). If not existing, it creates db and .db file(r_duck.db) in the location of dbdir. If you want to create a temporary database, dbConnect(duckdb()).

Loading Data with DuckDb and read_csv

system.time(duckdb_read_csv(con_db, 'duckread', 'Traffic_Violations.csv') )
dbListTables(con_db)

system.time(read_csv('Traffic_Violations.csv') )

The dataset in this post is the Traffic_Violations.csv from https://data.gov/, which is about 800mb in size. It has 1,949,392 rows and 43 columns. The time requires for the duckdb_read_csv is 4.23 seconds, whereas the read_csv requires 9.44 seconds.
duckdb_read_csv loads the csv file and creates a table('duckread') in the database(con_db). You can see the table, 'duckread' with dbListTables(con_db)

Extracting data from a table in the DB

# Method 1 : dbReadTable
duckread_1 <- dbReadTable(con_db, 'duckread')


# Method 2 : Only tbl
duckread_1 <- tbl(con_db, 'duckread',overwrite=T)


# Method 3 : view
dbExecute(con_db, "CREATE VIEW trafficview AS SELECT Belts, Fatal, Alcohol, 'Work Zone', 
'Personal Injury', 'Property Damage', Race, Year  FROM duckread" )

dbListTables(con_db)

duckread_1 <- tbl(con_db, 'trafficview',overwite=T)

There are three ways to extract data from a table. dbReadTable returns a data frame. You can use just tbl, which returns a tbl table. You can use it either way depending on your preference. To take advantage of the benefits of duckdb, it is appropriate to use the third method.
dbExecute excutes a SQL query. View is a virtual table or a query result from the DB table(duckread). The query creates a View('trafficview') with the eight columns I want to use from the whole dataset. After creating the view, extract a tbl data table from the view('trafficview') of DB like Method 2.

Data manipulation and Run Logit

# Create dummy variables for Fatal, Belts, Alcohol, Work.Zone, Personal.Injury, Property.Damage, Race
    
duckread_1 <- duckread_1 |>  mutate(Fatal_1 = ifelse(Fatal == "Yes", 1,0)) # repeat it for others

# Race -> dummies
library(recipes)  
more_dummy_data <- recipe(.~Race, data=duckread_1) |> step_dummy(Race, one_hot=T) |>
prep() |> bake(duckread_1)

# Logit 
model <- glm(Fatal_1 ~ Belts_1 + Alcohol_1 + Work.Zone_1 + Personal.Injury_1+ Property.Damage_1 
+ Year+ Race_ASIAN + Race_BLACK + Race_HISPANIC + Race_NATIVE.AMERICAN + Race_OTHER 
, data = more_dummy_data, family = "binomial")

This dataset consists mostly of categorical variables. The variables of interest are subjected to manipulation to create dummy variables. Then Run the logit model.

Comparison of Elapsed Time and Used Memory

library(bench)
mark(
# With DuckDB      
duckmodel <- glm(Fatal_1 ~  ....        , data = more_dummy_data, family = "binomial"),

# Without DuckDB
rmodel <- glm(Fatal_1 ~  ....        , data = more_dummy_data, family = "binomial")   )
    

# Memory comparison for each whole process.
library(pryr)
mem_used()

bench::mark compares the two logit models. In the model with DuckDB, the elapsed time is 18.5 sec and the allocated memory is 16.9 GB, whereas in the model without DuckDB, the elapsed time is 25.3 sec and the allocated memory is 20.5 GB.
A comparison of the used memory for each entire process is conducted. The entire process with DuckDB utilizes 1.62 GB, whereas the entire process without DuckDB requires 2.82 GB.

Updating DB

# Update with dbWriteTable()
dbWriteTable(con_db,'more_dummy_data',more_dummy_data,overwrite=T)
dbListTables(con_db)


# Export Data from the new table 
dbReadTable(con_db, 'more_dummy_data')

or

tbl(con_db, 'more_dummy_data')

# Disconnecting
dbDisconnect(con_db)

dbWriteTable creates a new table in the database(con_db). Save the working data(more_dummy_data) to the database as a new table('more_dummy_data').

Saving Time, Memory, and Money

The use of databases represents a potential solution to the constraints of computer memory when processing vast quantities of data. DuckDB has transformed what was previously a formidable challenge into a more straightforward and accessible process for social scientists. Although the data utilized in this post is not particularly voluminous, even if one's data exceeds the capacity of one's computer, DuckDB can assist in postponing the purchase of a new, high-performance computer.

Computational Social Science Post

Tips on R, Netlogo, and Python