Intro to GlueDb
Table of Contents
Intro
GlueDb is a database interface for accessing and managing heterogeneous collections of data artifacts, whether they’re built-in Python types, custom classes, or serialized objects.
I originally developed GlueDb to support my own analytics workflows, where keeping track of files, models, and configuration data quickly became unmanageable. But its utility extends far beyond analytics, offering a flexible system for any project that requires structured access to diverse data.
Accessing Data Artifacts
Let’s walk through an example using a GlueDb instance:
from pyswark.gluedb import api
db = api.connect( 'pyswark:/data/sma-example.gluedb' )
Note:
In the URI above, the pyswark:/ protocol is an alias to the pyswark-lib/pyswark path
When unpacked, the full URI points to pyswark-lib/pyswark/data/sma-example.gluedb
Access By Name
With the instance loaded, we can view the names tagged to each record.
Gluedb ensures that names in each instance are unique.
print( db.getNames() )
# ['JPM', 'BAC', 'kwargs']
We can get the record for a data artifact based on its name:
record = db.get( 'JPM' )
print( record.body )
Which outputs:
{
"model": "pyswark.gluedb.db.Contents",
"contents": {
"uri": "pyswark:/data/ohlc-jpm.csv.gz",
"datahandler": "",
"kw": {},
"datahandlerWrite": "",
"kwWrite": {}
}
}
We can then acquire the contents of the record:
record = db.get( 'JPM' )
contents = record.acquire()
print( type( contents ))
# <class 'pyswark.gluedb.db.Contents'>
And from the contents, we can extract the final data artifact:
JPM = contents.extract()
print( JPM.head(2) )
Which outputs:
Open High Low Close Volume Ex-Dividend Split Ratio
Date
1983-12-30 44.00 44.50 43.50 44.0 47000.0 0.0 1.0
1984-01-03 43.94 44.25 43.62 44.0 85667.0 0.0 1.0
You can also extract the artifact in one call:
JPM = db.extract( "JPM" ) # via string
Enum = db.enum
JPM = db.extract( Enum.JPM.value ) # via enum
Access by Query
SQLAlchemy expressions are supported in GlueDb:
from sqlalchemy import select
from pyswark.gluedb import table
recordsBefore2025 = db.getByQuery( select( table.Info ).where(
table.Info.date_created < '2025-01-01'
))
recordsAfter2025 = db.getByQuery( select( table.Info ).where(
table.Info.date_created > '2025-01-01'
))
print([ r.info.name for r in recordsBefore2025 ])
# ['JPM', 'BAC']
print([ r.info.name for r in recordsAfter2025 ])
# ['kwargs']
Managing Data Artifacts
REST-like operations are used to manage the GlueDb instance:
post
put
delete
For example, here’s how I used these operations to create and export the sma-example
database:
from pyswark.gluedb import api
from pyswark.core.models import collection, primitive
db = api.newDb()
db.post( 'JPM', 'pyswark:/data/ohlc-jpm.csv.gz' )
db.post( 'BAC', 'pyswark:/data/ohlc-bac.csv.gz' )
db.post( 'window', primitive.Int("60.0") )
db.post( 'kwargs', collection.Dict({ "window": 60 }))
db.delete( 'window' )
from pyswark.core.io.api import write
write( db, 'file:./sma-example.gluedb' )
Use Cases
GlueDb was originally built to support analytics workflows by acting as a lightweight configuration layer for data artifacts, like in the following example:
from pyswark.gluedb import api
db = api.connect( 'pyswark:/data/sma-example.gluedb' )
# extract the data
Enum = db.enum
JPM = db.extract( Enum.JPM.value )
BAC = db.extract( Enum.BAC.value )
kwargs = db.extract( Enum.kwargs.value )
# Calculate the simple moving average (SMA)
JPM_SMA = JPM.rolling( **kwargs ).mean()
BAC_SMA = BAC.rolling( **kwargs ).mean()
This pattern keeps the code clean, decoupled, and easily configurable. Swapping in new data or parameters doesn’t require digging through logic. It’s just a matter of updating the GlueDb instance.
Additional Use Cases
GlueDb’s flexibility makes it a good fit for a range of domains beyond analytics:
-
Data Migrations
Consolidate local files and objects into portable formats or move them seamlessly to cloud storage.
-
Machine Learning
Keep track of datasets, model versions, and parameters across training runs and experiments.
-
Reproducible Research
Version and reference datasets consistently to support transparency and replicability in published work.
Final Thoughts
GlueDb is ideal for developers, ML practitioners, and data wranglers who want a minimal but powerful system for storing, querying, and managing structured data artifacts.
Thanks for reading.