Intro to GlueDb
Table of Contents
Intro
GlueDb is a database interface for accessing and managing heterogeneous collections of data artifacts, whether they’re built-in Python types, custom classes, or serialized objects.
I originally developed GlueDb to support my own analytics workflows, where keeping track of files, models, and configuration data quickly became unmanageable. But its utility extends far beyond analytics, offering a flexible system for any project that requires structured access to diverse data.
Accessing Data Artifacts
Let’s walk through an example using a GlueDb instance:
from pyswark.core.io import api
db = api.read( 'pyswark:/data/sma-example.gluedb' )
Note:
In the URI above, the pyswark:/ protocol is an alias to the pyswark-lib/pyswark path
When unpacked, the full URI points to pyswark-lib/pyswark/data/sma-example.gluedb
Access By Name
With the instance loaded, we can view the names tagged to each record.
Gluedb ensures that names in each instance are unique.
print( db.getNames() )
# ['JPM', 'BAC', 'kwargs']
We can get the record for a data artifact based on its name:
record = db.get( 'JPM' )
print( record.body.toJson() )
Which outputs:
{
"model": "pyswark.core.models.body.Body",
"contents": {
"model": "pyswark.gluedb.models.IoModel",
"contents": "{\"uri\": \"pyswark:/data/ohlc-jpm.csv.gz\", \"datahandler\": \"\", \"kw\": {}, \"datahandlerWrite\": \"\", \"kwWrite\": {}}"
}
}
We can then acquire the contents of the record:
record = db.get( 'JPM' )
contents = record.acquire()
print( type( contents ))
# <class 'pyswark.gluedb.models.IoModel'>
And from the contents, we can extract the final data artifact:
JPM = contents.extract()
print( JPM.head(2) )
Which outputs:
Open High Low Close Volume Ex-Dividend Split Ratio
Date
1983-12-30 44.00 44.50 43.50 44.0 47000.0 0.0 1.0
1984-01-03 43.94 44.25 43.62 44.0 85667.0 0.0 1.0
You can also extract the artifact in one call:
JPM = db.extract( "JPM" ) # via string
Enum = db.enum
JPM = db.extract( Enum.JPM ) # via enum
Access by Query
SQLModel expressions are supported in GlueDb:
from sqlmodel import Session, select
sqlDb = db.asSQLModel() # convert gluedb to sqlmodel
with Session( sqlDb.engine ) as session:
recordsBefore2026 = session.exec(
select( sqlDb.RECORD )
.join( sqlDb.INFO )
.where( sqlDb.INFO.date_created < '2026-01-01' )
).all()
recordsAfter2026 = session.exec(
select( sqlDb.RECORD )
.join( sqlDb.INFO )
.where( sqlDb.INFO.date_created >= '2026-01-01' )
).all()
print([ r.asModel().info.name for r in recordsBefore2026 ])
# ['JPM', 'BAC']
print([ r.asModel().info.name for r in recordsAfter2026 ])
# ['kwargs']
sqlDb.dispose() # dispose the sqlmodel engine to release the connection pool
Managing Data Artifacts
REST-like operations are used to manage the GlueDb instance:
postputdelete
For example, here’s how I used these operations to create and export the sma-example database:
from pyswark.gluedb import api
from pyswark.core.models import collection, primitive
db = api.newDb()
db.post( 'pyswark:/data/ohlc-jpm.csv.gz', name=Enum.JPM )
db.post( 'pyswark:/data/ohlc-bac.csv.gz', name='BAC' )
db.post( primitive.Int("60.0"), name='window' )
db.post( collection.Dict({ "window": 60 }), name='kwargs' )
db.delete( 'window' )
from pyswark.core.io.api import write
write( db, 'file:./sma-example.gluedb' )
Persisting with Db.connect
In addition to api.read and api.write, GlueDb supports a context-manager pattern via Db.connect(). When persist=True, any changes made inside the with block are automatically written back to the .gluedb file on successful exit:
from pyswark.gluedb.db import Db
from pyswark.core.models import collection
# Open an existing catalog with auto-save on exit
with Db.connect('file:./sma-example.gluedb', persist=True) as db:
db.post( collection.Dict({ "window": 120 }), name='kwargs_120' )
# On exit: writes to file:./sma-example.gluedb
# Verify the change persisted
db = Db.connect('file:./sma-example.gluedb')
print( db.getNames() )
# ['JPM', 'BAC', 'kwargs', 'kwargs_120']
This is especially useful for workflows where the catalog is both read and updated in place, without having to manually call api.write.
Use Cases
GlueDb was originally built to support analytics workflows by acting as a lightweight configuration layer for data artifacts, like in the following example:
from pyswark.gluedb import api
db = api.read( 'pyswark:/data/sma-example.gluedb' )
# extract the data
Enum = db.enum
JPM = db.extract( Enum.JPM )
BAC = db.extract( Enum.BAC )
kwargs = db.extract( Enum.kwargs )
# Calculate the simple moving average (SMA)
JPM_SMA = JPM.rolling( **kwargs ).mean()
BAC_SMA = BAC.rolling( **kwargs ).mean()

This pattern keeps the code clean, decoupled, and easily configurable. Swapping in new data or parameters doesn’t require digging through logic. It’s just a matter of updating the GlueDb instance.
Additional Use Cases
GlueDb’s flexibility makes it a good fit for a range of domains beyond analytics:
-
Data Migrations
Consolidate local files and objects into portable formats or move them seamlessly to cloud storage.
-
Machine Learning
Keep track of datasets, model versions, and parameters across training runs and experiments.
-
Reproducible Research
Version and reference datasets consistently to support transparency and replicability in published work.
Final Thoughts
GlueDb is ideal for developers, ML practitioners, and data wranglers who want a minimal but powerful system for storing, querying, and managing structured data artifacts.
Thanks for reading.