PySpark Pipeline

This project implements a data pipeline using PySpark for data ingestion and processing. It provides functionality to read data from various sources including PostgreSQL databases and CSV files, process the data using PySpark, and store the results.

Features

Data ingestion from multiple sources:
- PostgreSQL database using JDBC
- PostgreSQL database using pandas
- Direct SQL queries to Spark
Configurable logging system
Modular pipeline architecture
Support for both batch and streaming data processing

Project Structure

pipeline/
├── ingest.py          # Data ingestion module
├── transform.py       # Data transformation module
├── persist.py         # Data persistence module
└── resources/
    ├── configs/
    │   ├── logging.conf  # Logging configuration
    │   └── pipeline.ini  # Pipeline configuration
    └── postgresql-42.2.18.jar  # PostgreSQL JDBC driver

Prerequisites

Python 3.8 or higher
Apache Spark 3.5.0
PostgreSQL 12 or higher
Java 8 or higher (required for Spark)

Installation

Clone the repository:

git clone https://github.com/imratnesh/pyspark-pipeline.git
cd pyspark-pipeline

Create and activate a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install the required dependencies:

pip install -r requirements.txt

Configuration

PostgreSQL Setup:
- Ensure PostgreSQL is running on localhost:5432
- Default credentials (modify as needed):
  - Username: postgres
  - Password: xxxx
  - Database: postgres
- Create required schemas and tables:
  - Use the provided SQL script to create schemas and tables:
```
psql -U postgres -d postgres -f CREATE_TABLES.sql
```
  - This will create:
    - futurexschema.futurex_course_catalog
    - fxxcoursedb.fx_course_table
Logging Configuration:
- Logging settings are in pipeline/resources/configs/logging.conf
- Adjust log levels and output paths as needed

Usage

The pipeline provides several methods for data ingestion:

Direct SQL query to Spark:

from pipeline.ingest import Ingest
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Pipeline").getOrCreate()
ingest = Ingest(spark)
df = ingest.ingest_data()

PostgreSQL ingestion using pandas:

ingest.read_from_pg()

PostgreSQL ingestion using JDBC:

ingest.read_from_pg_using_jdbc_driver()

Verification / Quickstart

After installation and configuration, verify your setup:

Check Python dependencies:

pip list | grep -E "pyspark|psycopg2|pandas"

Check Spark and Java installation:

spark-submit --version
java -version

Check PostgreSQL connection and tables:

psql -U postgres -d postgres -c "\dt futurexschema.*"
psql -U postgres -d postgres -c "\dt fxxcoursedb.*"

Run a sample pipeline ingestion (from Python shell):

from pipeline.ingest import Ingest
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Pipeline").getOrCreate()
ingest = Ingest(spark)
df = ingest.ingest_data()
print(df.show())

If you see a DataFrame output, your setup is correct!

Development

Follow PEP 8 style guide for Python code
Ensure proper logging is implemented for all operations
Test database connections before running the pipeline
Use type hints for better code maintainability
Write unit tests for new features

Contributing

Fork the repository
Create a feature branch
Commit your changes
Push to the branch
Create a Pull Request

Screenshots

Screenshot 1: HiveServer2 Web UI or Beeline Connection

Screenshot 2: Hive Query Result Example

License

This project is licensed under the MIT License - see the LICENSE file for details.

Connect with Me

LinkedIn: Ratnesh Kushwaha
YouTube: India Analytica

Support

REF For support, please open an issue in the GitHub repository or contact the maintainers.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
2025_pyspark_bank_marketing_project_futurexskills		2025_pyspark_bank_marketing_project_futurexskills
docs		docs
pipeline		pipeline
pyspark_rdd_actions_transformations		pyspark_rdd_actions_transformations
spark_joins_1		spark_joins_1
sql_files		sql_files
test		test
txt_commands		txt_commands
.gitignore		.gitignore
README.md		README.md
data_pipeline.py		data_pipeline.py
fx_udf.ipynb		fx_udf.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PySpark Pipeline

Features

Project Structure

Prerequisites

Installation

Configuration

Usage

Verification / Quickstart

Development

Contributing

Screenshots

Screenshot 1: HiveServer2 Web UI or Beeline Connection

Screenshot 2: Hive Query Result Example

License

Connect with Me

Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

imratnesh/pyspark-pipeline

Folders and files

Latest commit

History

Repository files navigation

PySpark Pipeline

Features

Project Structure

Prerequisites

Installation

Configuration

Usage

Verification / Quickstart

Development

Contributing

Screenshots

Screenshot 1: HiveServer2 Web UI or Beeline Connection

Screenshot 2: Hive Query Result Example

License

Connect with Me

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages