Skip to content

chi-yang-db/gitfoldertest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Zerobus - File Mode

A lightweight, no‑code file ingestion workflow. Configure a set of tables, get a volume path for each, and drop files into those paths—your data lands in Unity Catalog tables via Auto Loader and Lakeflow Pipeline.

Table of Contents


Quick Start

Step 1. Configure tables

Edit table configs in ./src/configs/tables.json. Only name and format are required.

Currently supported formats are csv, json, avro and parquet.

For supported format_options, see the Auto Loader options. Not all options are supported here. If unsure, specify only name and format, or follow Debug Table Issues to discover the correct options.

[
  {
    "name": "table1",
    "format": "csv",
    "format_options":
    {
      "escape": "\""
    },
    "schema_hints": "id int, name string"
  },
  {
    "name": "table2",
    "format": "json"
  }
]

Tip: Keep schema_hints minimal; Auto Loader can evolve the schema as new columns appear.

Step 2. Deploy & set up

databricks bundle deploy
databricks bundle run configuration_job

Wait for the configuration job to finish before moving on.

Step 3. Retrieve endpoint & push files

First, grant write permissions to the volume. This enables the client to push files:

databricks bundle open filepush_volume

Fetch the volume path for uploading files to a specific table (example: table1):

databricks tables get chi_cata.filepushschema.table1 --output json \
  | jq -r '.properties["filepush.table_volume_path_data"]'

Example output:

/Volumes/chi_cata/filepushschema/chi_cata_filepushschema_filepush_volume/data/table1

Upload files to the path above using any of the Volumes file APIs.

Databricks CLI example (destination uses the dbfs: scheme):

databricks fs cp /local/file/path/datafile1.csv \
  dbfs:/Volumes/chi_cata/filepushschema/chi_cata_filepushschema_filepush_volume/data/table1

REST API example:

# prerequisites: export DATABRICKS_HOST and DATABRICKS_TOKEN (PAT token)
curl -X PUT "$DATABRICKS_HOST/api/2.0/fs/files/Volumes/chi_cata/filepushschema/chi_cata_filepushschema_filepush_volume/data/table1/datafile1.csv" \
  -H "Authorization: Bearer $DATABRICKS_TOKEN" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @"/local/file/path/datafile1.csv"

Within about a minute, the data should appear in the table chi_cata.filepushschema.table1.


Debug Table Issues

If data isn’t parsed as expected, use dev mode to iterate on table options safely.

Step 1. Configure tables to debug

Configure tables as in Step 1 of Quick Start.

Step 2. Deploy & set up in dev mode

databricks bundle deploy -t dev
databricks bundle run configuration_job -t dev

Wait for the configuration job to finish. Example output:

2025-09-23 22:03:04,938 [INFO] initialization - ==========
catalog_name: chi_cata
schema_name: dev_first_last_filepushschema
volume_path_root: /Volumes/chi_cata/dev_first_last_filepushschema/chi_cata_filepushschema_filepush_volume
volume_path_data: /Volumes/chi_cata/dev_first_last_filepushschema/chi_cata_filepushschema_filepush_volume/data
volume_path_archive: /Volumes/chi_cata/dev_first_last_filepushschema/chi_cata_filepushschema_filepush_volume/archive
==========

Note: In dev mode, the schema name is prefixed. Use the printed schema name for the remaining steps.

Step 3. Retrieve endpoint & push files to debug

Get the dev volume path (note the prefixed schema):

databricks tables get chi_cata.dev_first_last_filepushschema.table1 --output json \
  | jq -r '.properties["filepush.table_volume_path_data"]'

Example output:

/Volumes/chi_cata/dev_first_last_filepushschema/chi_cata_filepushschema_filepush_volume/data/table1

Then follow the upload instructions from Quick Start → Step 3 to send test files.

Step 4. Debug table configs

Open the pipeline in the workspace:

databricks bundle open refresh_pipeline -t dev

Click Edit pipeline to launch the development UI. Open the debug_table_config notebook and follow its guidance to refine the table options. When satisfied, copy the final config back to ./src/configs/tables.json.

Step 5. Fix the table configs in production

Redeploy the updated config and run a full refresh to correct existing data for an affected table:

databricks bundle deploy
databricks bundle run refresh_pipeline --full-refresh table1

That’s it! You now have a managed, push-based file ingestion workflow with debuggable table configs and repeatable deployments!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published