Skip to content

Support querying files directly in posit connect with duckdb #197

@machow

Description

@machow

Since pins-python uses fsspec under the hood, users are able to query pins data directly using duckdb's fsspec integration.

While #193 allows duckdb to query CSV pins on posit connect, parquet files cannot be queried. This is likely because duckdb needs to scan parquet headers.

Below, I provide examples, but first--here is a snippet to enable logging to stdout:

import logging
import sys

root = logging.getLogger("pins")
root.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
root.addHandler(handler)

Querying parquet pins on s3 (for reference)

First, here is how you connect to a temporary s3 board, and return info on a file:

# create a temporary board, with the contents of the pins-compat test board ----
# note that my s3 credentials are in a .env file, see .env.dev
from dotenv import load_dotenv
load_dotenv()

bb = BoardBuilder("s3")
board = bb.create_tmp_board("pins/tests/pins-compat")

# display info for a csv file ----
board.fs.info(f"s3://{board.board}/df_csv/20220214T163718Z-eceac/df_csv.csv")
{'ETag': '"e6e2bc89538baa1ee31e3294efcb1d82"',
 'LastModified': datetime.datetime(2023, 4, 12, 15, 22, 41, tzinfo=tzutc()),
 'size': 20,
 'name': 'ci-pins/222afb60-8e19-4cc0-b1a5-80098d0f410d/df_csv/20220214T163718Z-eceac/df_csv.csv',
 'type': 'file',
 'StorageClass': 'STANDARD',
 'VersionId': None,
 'ContentType': 'text/csv'}

Next, we'll add a parquet pin

from pins.data import mtcars

board.pin_write(mtcars, "df_parquet", type="parquet")
board.pin_versions("df_parquet")
              created   hash                 version
0 2023-04-12 11:30:57  69d97  20230412T113057Z-69d97

Finally, we'll query directly in duckdb

import duckdb

duckdb.register_filesystem(board.fs)

# query via duckdb! ----
data_path = f"s3://{board.board}/df_parquet/20230412T113057Z-69d97/df_parquet.parquet"
duckdb.execute(f"SELECT mpg FROM read_parquet('{data_path}')").df()

Querying parquet in pins

import pins
import duckdb

# note that my connect credentials are in a .env file
from dotenv import load_dotenv
load_dotenv()

# connect to board, register fs to duckdb ----
board = pins.board_connect("https://colorado.posit.co/rsc")
duckdb.register_filesystem(board.fs)

# look up bundle id ----
board.pin_meta("michael.chow/mtcars3")

# query with duckdb ----
duckdb.execute(
     "SELECT * FROM read_parquet('rsc://michael.chow/mtcars3/72103/mtcars3.parquet')"
).df()
InvalidInputException: Invalid Input Error: No magic bytes found at end of file 'rsc://michael.chow/mtcars3/72103/mtcars3.parquet'

I think the issue has to do with how we're returning info on the file.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions