Harnessing Apache Arrow for High-Speed Data Transfer from SQL Server to Python with mssql-python

By — min read

Overview

Fetching a million rows from SQL Server into a Polars DataFrame used to mean a million Python objects, a million garbage collector allocations, and then throwing it all away to build a DataFrame. That inefficient workflow is now a thing of the past. The mssql-python driver has introduced support for retrieving SQL Server data directly as Apache Arrow structures. This provides a faster and more memory-efficient path for anyone working with SQL Server data in Polars, Pandas, DuckDB, or any other Arrow-native library. The feature was contributed by community developer Felix Graßl (@ffelixg), and it marks a significant step forward in data interoperability.

Harnessing Apache Arrow for High-Speed Data Transfer from SQL Server to Python with mssql-python — Source: devblogs.microsoft.com

Before diving into the technical details, let's clarify some key terms:

API (Application Programming Interface): a source-code contract that defines how to call a function or library.
ABI (Application Binary Interface): a binary-level contract that specifies how compiled code is laid out in memory. Two programs built in different languages can share an ABI and exchange data directly — no serialization is needed.
Arrow C Data Interface: Apache Arrow's ABI specification — the standard that makes zero-copy data exchange between languages possible.

Prerequisites

Before you begin, ensure you have the following:

Python 3.8 or higher installed on your system.
A SQL Server instance (local or remote) that you can connect to.
The mssql-python driver (version that includes Arrow support; check latest release).
One or more Arrow-native libraries, such as pyarrow, polars, pandas (with ArrowDtype), or duckdb.

Install the necessary packages using pip:

pip install mssql-python pyarrow polars pandas duckdb

Step-by-Step Instructions

1. Connecting to SQL Server

Start by establishing a connection to your SQL Server instance. The mssql-python driver uses a familiar connection string format:

import mssql

conn = mssql.connect(
    server='localhost',
    database='your_database',
    username='your_username',
    password='your_password'
)

If you're using integrated authentication (Windows), you can omit the username and password parameters. Ensure the server and database names are correct.

2. Fetching Data as Apache Arrow Structures

The key improvement in mssql-python is the ability to fetch query results directly into Arrow record batches. Use the cursor.execute() method as usual, but then call fetch_arrow() instead of fetchall():

cursor = conn.cursor()
cursor.execute("SELECT * FROM your_table")
batches = cursor.fetch_arrow()  # Returns a list of pyarrow.RecordBatch

The fetch_arrow() method returns a sequence of Arrow record batches. Each batch holds a contiguous chunk of columnar data. You can combine them into a single Arrow table using pyarrow.Table.from_batches():

import pyarrow as pa

arrow_table = pa.Table.from_batches(batches)

Now you have the entire result set in Arrow format, with zero Python object creation per row during fetching.

3. Using Arrow Data with Polars

Polars can consume Arrow data directly without any conversion overhead. Pass the Arrow table to Polars:

import polars as pl

df_polars = pl.from_arrow(arrow_table)

Alternatively, you can feed the record batches directly to Polars' from_arrow() method. Polars will use the underlying Arrow buffers, avoiding any additional copying. This is the recommended approach for high-performance pipelines.

4. Using Arrow Data with Pandas

Pandas supports Arrow-backed columns via the ArrowDtype. To create a Pandas DataFrame from an Arrow table, use:

import pandas as pd

df_pandas = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)

This ensures the underlying data is stored as Arrow arrays, not as Python objects. The performance benefits are most noticeable with large datasets and string columns.

5. Using Arrow Data with DuckDB

DuckDB has first-class Arrow support. You can register an Arrow table as a virtual table and query it directly:

import duckdb

con = duckdb.connect()
con.register('my_data', arrow_table)
result = con.execute("SELECT * FROM my_data WHERE column > 100").fetcharrow()
print(result)

DuckDB can also read Arrow from memory without copies, making it an excellent companion for analytical workloads.

6. Advanced: Working with C Data Interface

For maximum performance, you can bypass Python-level wrappers and use the Arrow C Data Interface directly. This is useful when passing data between libraries without any intermediate representation. mssql-python exposes the underlying C pointer arrays; see the library documentation for advanced use cases. Most users will be fine with the high-level fetch_arrow() method.

Common Mistakes

Forgetting to install pyarrow: The fetch_arrow() method depends on pyarrow. Without it, you'll get an import error. Install pyarrow separately or via mssql[arrow] if the driver provides an extras option.
Using fetchall() instead of fetch_arrow(): The old method returns Python objects. To leverage Arrow, you must explicitly call fetch_arrow().
Version mismatches: Ensure your mssql-python version is recent enough to include the Arrow feature. Check the changelog or GitHub releases.
Ignoring data type limitations: Not all SQL Server types map perfectly to Arrow types. Complex types like XML may fall back to string representation. Test your queries.
Oversized batches: The default batch size may be too large for limited memory. Use cursor.arraysize to control the number of rows per batch.

Summary

Apache Arrow support in mssql-python revolutionizes how Python applications transfer data from SQL Server. By fetching data as columnar Arrow records directly from the driver, you avoid the traditional overhead of constructing Python objects row by row. This results in faster loading, lower memory consumption, and seamless interoperability with modern DataFrame libraries like Polars, Pandas (with ArrowDtype), and DuckDB. The prerequisites are minimal: a recent Python, the updated driver, and an Arrow-native library. Follow the step-by-step instructions above to start benefiting today.

Tags: