-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Labels
Description
Since pins-python uses fsspec under the hood, users are able to query pins data directly using duckdb's fsspec integration.
While #193 allows duckdb to query CSV pins on posit connect, parquet files cannot be queried. This is likely because duckdb needs to scan parquet headers.
Below, I provide examples, but first--here is a snippet to enable logging to stdout:
import logging
import sys
root = logging.getLogger("pins")
root.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
handler.setFormatter(formatter)
root.addHandler(handler)Querying parquet pins on s3 (for reference)
First, here is how you connect to a temporary s3 board, and return info on a file:
# create a temporary board, with the contents of the pins-compat test board ----
# note that my s3 credentials are in a .env file, see .env.dev
from dotenv import load_dotenv
load_dotenv()
bb = BoardBuilder("s3")
board = bb.create_tmp_board("pins/tests/pins-compat")
# display info for a csv file ----
board.fs.info(f"s3://{board.board}/df_csv/20220214T163718Z-eceac/df_csv.csv"){'ETag': '"e6e2bc89538baa1ee31e3294efcb1d82"',
'LastModified': datetime.datetime(2023, 4, 12, 15, 22, 41, tzinfo=tzutc()),
'size': 20,
'name': 'ci-pins/222afb60-8e19-4cc0-b1a5-80098d0f410d/df_csv/20220214T163718Z-eceac/df_csv.csv',
'type': 'file',
'StorageClass': 'STANDARD',
'VersionId': None,
'ContentType': 'text/csv'}Next, we'll add a parquet pin
from pins.data import mtcars
board.pin_write(mtcars, "df_parquet", type="parquet")
board.pin_versions("df_parquet") created hash version
0 2023-04-12 11:30:57 69d97 20230412T113057Z-69d97
Finally, we'll query directly in duckdb
import duckdb
duckdb.register_filesystem(board.fs)
# query via duckdb! ----
data_path = f"s3://{board.board}/df_parquet/20230412T113057Z-69d97/df_parquet.parquet"
duckdb.execute(f"SELECT mpg FROM read_parquet('{data_path}')").df()Querying parquet in pins
import pins
import duckdb
# note that my connect credentials are in a .env file
from dotenv import load_dotenv
load_dotenv()
# connect to board, register fs to duckdb ----
board = pins.board_connect("https://colorado.posit.co/rsc")
duckdb.register_filesystem(board.fs)
# look up bundle id ----
board.pin_meta("michael.chow/mtcars3")
# query with duckdb ----
duckdb.execute(
"SELECT * FROM read_parquet('rsc://michael.chow/mtcars3/72103/mtcars3.parquet')"
).df()InvalidInputException: Invalid Input Error: No magic bytes found at end of file 'rsc://michael.chow/mtcars3/72103/mtcars3.parquet'I think the issue has to do with how we're returning info on the file.
SamEdwardes