Intro to GlueDb

Accessing Data Artifacts
Managing Data Artifacts
Use Cases
Final Thoughts

Intro

GlueDb is a database interface for accessing and managing heterogeneous collections of data artifacts, whether they’re built-in Python types, custom classes, or serialized objects.

I originally developed GlueDb to support my own analytics workflows, where keeping track of files, models, and configuration data quickly became unmanageable. But its utility extends far beyond analytics, offering a flexible system for any project that requires structured access to diverse data.

Accessing Data Artifacts

Let’s walk through an example using a GlueDb instance:

from pyswark.gluedb import api

db = api.connect( 'pyswark:/data/sma-example.gluedb' )

Note:
In the URI above, the pyswark:/ protocol is an alias to the pyswark-lib/pyswark path
When unpacked, the full URI points to pyswark-lib/pyswark/data/sma-example.gluedb

Access By Name

With the instance loaded, we can view the names tagged to each record.

Gluedb ensures that names in each instance are unique.

print( db.getNames() )
# ['JPM', 'BAC', 'kwargs']

We can get the record for a data artifact based on its name:

record = db.get( 'JPM' )
print( record.body )

Which outputs:

{
  "model": "pyswark.gluedb.db.Contents",
  "contents": {
    "uri": "pyswark:/data/ohlc-jpm.csv.gz",
    "datahandler": "",
    "kw": {},
    "datahandlerWrite": "",
    "kwWrite": {}
  }
}

We can then acquire the contents of the record:

record   = db.get( 'JPM' )
contents = record.acquire()
print( type( contents ))
# <class 'pyswark.gluedb.db.Contents'>

And from the contents, we can extract the final data artifact:

JPM = contents.extract()
print( JPM.head(2) )

Which outputs:

             Open   High    Low  Close   Volume  Ex-Dividend  Split Ratio 
Date                                                                        
1983-12-30  44.00  44.50  43.50   44.0  47000.0          0.0          1.0   
1984-01-03  43.94  44.25  43.62   44.0  85667.0          0.0          1.0   

You can also extract the artifact in one call:

JPM = db.extract( "JPM" ) # via string

Enum = db.enum
JPM  = db.extract( Enum.JPM.value ) # via enum

Access by Query

SQLAlchemy expressions are supported in GlueDb:

from sqlalchemy import select
from pyswark.gluedb import table

recordsBefore2025 = db.getByQuery( select( table.Info ).where( 
    table.Info.date_created < '2025-01-01' 
))
recordsAfter2025 = db.getByQuery( select( table.Info ).where( 
    table.Info.date_created > '2025-01-01' 
))

print([ r.info.name for r in recordsBefore2025 ])
# ['JPM', 'BAC']

print([ r.info.name for r in recordsAfter2025 ])
# ['kwargs']

Managing Data Artifacts

REST-like operations are used to manage the GlueDb instance:

post
put
delete

For example, here’s how I used these operations to create and export the sma-example database:

from pyswark.gluedb import api
from pyswark.core.models import collection, primitive

db = api.newDb()
db.post( 'JPM', 'pyswark:/data/ohlc-jpm.csv.gz' )
db.post( 'BAC', 'pyswark:/data/ohlc-bac.csv.gz' )
db.post( 'window', primitive.Int("60.0") )
db.post( 'kwargs', collection.Dict({ "window": 60 }))
db.delete( 'window' )

from pyswark.core.io.api import write

write( db, 'file:./sma-example.gluedb' )

Use Cases

GlueDb was originally built to support analytics workflows by acting as a lightweight configuration layer for data artifacts, like in the following example:

from pyswark.gluedb import api

db = api.connect( 'pyswark:/data/sma-example.gluedb' )

# extract the data
Enum   = db.enum
JPM    = db.extract( Enum.JPM.value )
BAC    = db.extract( Enum.BAC.value )
kwargs = db.extract( Enum.kwargs.value )

# Calculate the simple moving average (SMA)
JPM_SMA = JPM.rolling( **kwargs ).mean()
BAC_SMA = BAC.rolling( **kwargs ).mean()

sma-plot

This pattern keeps the code clean, decoupled, and easily configurable. Swapping in new data or parameters doesn’t require digging through logic. It’s just a matter of updating the GlueDb instance.

Additional Use Cases

GlueDb’s flexibility makes it a good fit for a range of domains beyond analytics:

Data Migrations

Consolidate local files and objects into portable formats or move them seamlessly to cloud storage.
Machine Learning

Keep track of datasets, model versions, and parameters across training runs and experiments.
Reproducible Research

Version and reference datasets consistently to support transparency and replicability in published work.

Final Thoughts

GlueDb is ideal for developers, ML practitioners, and data wranglers who want a minimal but powerful system for storing, querying, and managing structured data artifacts.

Thanks for reading.

Table of Contents