Accessing an Access Database with Pandas

Fri 26 May 2023

Filed under Data Engineering

Tags pandas access

Interfacing Pandas to an Access Database

Introduction

This post is intended to be a guide for Python users, who wish to process a Microsoft (MS) Access database. The assumption is that pandas will be the primary analysis tool.

One option is to read the Access datatables into a MS Excel spreadsheet. This allows for an initial exploration, and access to Excel spreadsheets is well supported by pandas, but the two step translation loses the meta-data (column types, etc), so direct interfacing was chosen for implementation.

Background

Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The extensive support for ODBC in the Python ecosystem makes it the obvious choice for interfacing to MS Access.

As shown below, the appropriate drivers from MS must be installed to support the interfacing operations.

pyodbc Package

The pyodbc package was the first choice for an implementation of code to read the databasde

Initial Implementation

Imports

In this section, we import the Python packages we will use.

In [21]:
import pandas as pd
import pyodbc

# used for environment documentation
import sys
import os
import platform
import datetime
In [22]:
%load_ext watermark
The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
In [23]:
%load_ext lab_black
The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black

Driver environment

As discussed above, the appropriate software drivers from MS must be installed to support the intergace to Access databases. We can determine the drivers available by the following code snippet, which create a list of the drivers available (as seen by the pyodbc package).

In [24]:
[x for x in pyodbc.drivers()]
Out[24]:
['SQL Server',
 'SQL Server Native Client 11.0',
 'Microsoft Access Driver (*.mdb, *.accdb)',
 'Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)',
 'Microsoft Access dBASE Driver (*.dbf, *.ndx, *.mdx)',
 'Microsoft Access Text Driver (*.txt, *.csv)']

In the initial run of this snippet, the only drivers available were the SQL Server drivers. Going to the link

https://www.microsoft.com/en-us/download/details.aspx?id=54920

and clicking the "Download" button will install the Microsoft Access Driver, as shown above.

Testing the Interface

To test the interface, we create a connection string that specifies the interfacing driver we need, and the database location (in this case, local to the machine running Python). Then we create a connection.

In [25]:
connStr = (
    r"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"
    r"DBQ=D:\WaterQualityDB2022\wq2022.accdb;"
)
cnxn = pyodbc.connect(connStr)

We know from the exploration via Excel, that there is a table Equipment. We attempt to read the table into pandas.

In [26]:
sql_qry = 'select * from Equipment'
df = pd.read_sql(sql_qry, cnxn)
df.head()
D:\Anaconda2022\lib\site-packages\pandas\io\sql.py:762: UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy
  warnings.warn(
Out[26]:
Equipment ID Equipment Name Serial Probe Serial Model
0 1 Maroochy 1 Horiba U52
1 2 Maroochy 2 Horiba U52
2 3 Maroochy 3 Horiba U52
3 4 Maroochy 4 Horiba U52
4 5 Maroochy 5 Horiba U52

The problem here is that although this works, we have a warning message from pandas that the ODC access we are using is not supported:

UserWarning: pandas only support SQLAlchemy connectable(engine/connection) ordatabase string URI or sqlite3 DBAPI2 connectionother DBAPI2 objects are not tested, please consider using SQLAlchemy

Now in general, application or package warning messages have to considered seriously, especially in software intended to have an extended lifetime, because subsequent updates to pandas could totally remove support for non-SQLAlchemy access methods.

SQLAlchemy Implementation

Investigation of integration of the SQLAlchemy package, pandas, and MS Access revealed one potential item of concern: MS Access is not a supported database that the SQLAlchemy project supports. I suspect that this is because Python (especially in the machine learning domain) is oriented towards large to very large databases, and not the relatively small databases that MS Access is typically used for.

However, there is an independent Python package sqlalchemy_access, that provides the interface between SQLAlchemy and MS Access. Note that we still have to import the pyodbc package. sqlalchemy_access did not appear to be in the Anaconda ecosystem, so I just downloaded it from GitHub GitHub link

Implementation

Early SQLAlchemy

Again, we define the packages we wish to import.

In [27]:
import pandas as pd
import sqlalchemy as sa
import sqlalchemy_access

import pyodbc

Next (as before), we define a connection specification, and establish a connection to the database.

In [28]:
import sqlalchemy_access

connection_string = (
    r"DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};"
    r"DBQ=D:\WaterQualityDB2022\wq2022.accdb;"
)
connection_url = sa.engine.URL.create(
    "access+pyodbc",
    query={"odbc_connect": connection_string},
)
engine = sa.create_engine(connection_url)

Now we can retrieve the Equipment table, and display the contents, without generating warning messages.

In [29]:
qry = 'select * from Equipment;'
df = pd.read_sql_query(qry, engine)
df.head(1).T
Out[29]:
0
Equipment ID 1
Equipment Name Maroochy 1
Serial
Probe Serial
Model Horiba U52

Later SQLAlchemy

There is a problem: at sqlalchemy 2.0.11, the code above gives :

AttributeError: 'OptionEngine' object has no attribute 'execute'

Refer to https://levelup.gitconnected.com/how-to-fix-attributeerror-optionengine-object-has-no-attribute-execute-in-pandas-eb635fbb89e4

The suggested fix is

import pandas as pd
from sqlalchemy import create_engine, text


engine = create_engine(connection_url)
query = 'SELECT * FROM my_table'

df = pd.read_sql_query(sql=text(query), con=engine.connect())

This fix also works fine at sqlalchemy 1.4.39

In [30]:
df = pd.read_sql_query(
    sql=sa.text(qry),
    con=engine.connect(),
)
df.head(1).T
Out[30]:
0
Equipment ID 1
Equipment Name Maroochy 1
Serial
Probe Serial
Model Horiba U52

Database Structure

We can confirm that SQLAlchemy 'sees' the same database tables, as can be seen via Excel.

In [31]:
from sqlalchemy import inspect

insp = inspect(engine)
insp.get_table_names()
Out[31]:
['Equipment',
 'Maintenance',
 'Parameter Standards',
 'People',
 'Results_Kit_Calibration',
 'Results_Sites_Water_Testing',
 'Sites']
In [ ]:
 

Reproducibility

This sections contains information to support the reproducibility of the software.

In [32]:
import sys
import os
import platform
import datetime
In [33]:
# | warn: false
%reload_ext watermark
In [34]:
# | warn: false
%reload_ext lab_black

Python execution environment

In [35]:
theNotebook = 'InterfacingToWQDatabase.ipynb'

print(f'Notebook: {theNotebook}')
print('python version : ' + sys.version)
print('pandas version : ' + pd.__version__)
print('current wkg dir: ' + os.getcwd())
print(
    'Notebook run at: '
    + str(datetime.datetime.now())
    + ' local time'
)
print(
    'Notebook run at: '
    + str(datetime.datetime.utcnow())
    + ' UTC'
)
print('Notebook run on: ' + platform.platform())
Notebook: InterfacingToWQDatabase.ipynb
python version : 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]
pandas version : 1.4.4
current wkg dir: C:\Users\donrc\Documents\JupyterNotebooks\PythonNotebookProject\develop
Notebook run at: 2023-05-24 20:30:00.363071 local time
Notebook run at: 2023-05-24 10:30:00.364066 UTC
Notebook run on: Windows-10-10.0.19045-SP0

watermark data

Execution environment as reported by the watermark utility

In [36]:
%watermark
Last updated: 2023-05-24T20:30:00.391079+10:00

Python implementation: CPython
Python version       : 3.9.13
IPython version      : 7.31.1

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
CPU cores   : 8
Architecture: 64bit

In [37]:
%watermark -h -iv -co
conda environment: D:\Anaconda2022

Hostname: DESKTOP-SODFUN6

sys              : 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)]
pandas           : 1.4.4
pyodbc           : 4.0.34
sqlalchemy_access: 1.1.4.dev0
sqlalchemy       : 1.4.39
platform         : 1.0.8

In [ ]:
 

Comments


net-analysis.com Data Analysis Blog © Don Cameron Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More