IntroductionÂ
Modern data platforms must ingest and process data from a wide variety of sources ranging from SFTP servers and cloud object storage like S3 to structured files such as CSV. As organizations scale, managing these heterogeneous ingestion pipelines using hardcoded logic becomes brittle, difficult to maintain, and error-prone. A metadata-driven framework built with PySpark offers a scalable and flexible solution to this challenge.Â
This article presents a technical design for a metadata table-driven PySpark framework capable of handling multi-source data delivery at scale. It covers the problem, architectural solution, implementation approach, and concludes with benefits and extensibility.Â
Problem StatementÂ
Organizations dealing with large-scale data ingestion face several recurring challenges:Â
- Heterogeneous Data Sources –Â Data arrives from SFTP servers, S3 buckets, APIs, and local file systems. Each source has unique connection, authentication, and file format requirements.
- Hardcoded Pipelines –Â Traditional ETL jobs often embed source-specific logic. Adding a new source requires code changes, testing, and redeployment.
- Lack of Scalability –Â Pipelines are not reusable. Operational overhead increases linearly with the number of sources.
- Poor Observability –Â Limited tracking of ingestion status, errors, and retries.Difficult to debug failures across distributed systems.
- Schema Variability –Â Different file formats (CSV, JSON, Parquet) and evolving schemas complicate ingestion.Â
Solution OverviewÂ
The proposed solution is a metadata-driven ingestion framework using PySpark. Instead of writing separate pipelines, we define ingestion behavior using metadata tables. Key principles followed were:Â
- Configuration over codeÂ
- Reusable ingestion engineÂ
- Dynamic source handlingÂ
- Scalable and distributed processingÂ
- Extensible architectureÂ
Architecture Components
1. Metadata Tables
Central to the framework are metadata tables stored in a relational database or Hive metastore.Â
Example: Source Metadata TableÂ
| Column Name | Description |
| source_id | Unique identifier |
| source_type | SFTP / S3 / LOCAL |
| file_format | CSV / JSON / PARQUET |
| file_path | Source path |
| delimiter | For CSV |
| schema_location | Optional schema definition |
| target_table | Destination table |
| load_type | FULL / INCREMENTAL |
| active_flag | Enable/disable |
2. Ingestion Engine (PySpark) that dynamically executes logic based on metadata. The architecture includes:
- Source Connectors:Â For ingesting data from SFTP, S3 (native Spark), and Local/CSV files.Â
- Processing Layer:Â Responsible for schema enforcement, data validation, and transformation logic.Â
- Target Layer:Â Where processed data is stored, including a Data Lake (S3 / HDFS) and a Data Warehouse (Delta / Hive / Iceberg).Â
- Logging & Monitoring:Â Crucial for audit trails, error tracking, and managing retry mechanisms.Â
Implementation DetailsÂ
Step 1: Load MetadataÂ
Step 2: Dynamic Source HandlingÂ
Step 3: SFTP Integration (Simplified)Â
Step 4: Processing and WritingÂ
Step 5: Driver LogicÂ
Data Ingestion Best PracticesÂ
1. Schema Management (Evolution)
- Store the schema within metadata or a dedicated schema registry.Â
- Apply the schema dynamically using spark.read.schema().Â
2. Incremental Data Loading
- Utilize watermark columns, such as a timestamp field.Â
- Maintain the last successfully processed value in an audit table for tracking.Â
3. Execution Parallelization
- Avoid the use of .collect() to prevent pulling all data onto a single node.Â
- Leverage Spark’s inherent RDD parallelism.Â
- Employ task orchestration tools (e.g., Airflow) for managing parallel workflows.Â
4. File Discovery & Matching
- Support pattern matching with wildcards (e.g., *.csv).Â
- Implement automatic detection for newly added files.Â
5. Data Quality Checks (Validation)
- Perform null value verification.Â
- Ensure schema conformity (Schema validation).Â
- Set and monitor row count thresholds.Â
Benefits of Metadata-Driven DesignÂ
New connectors (API, Kafka, etc.) can be added easily.The system offers several key benefits. Â
- Scalability is ensured as adding a new source only requires inserting a new metadata record without any code changes. Â
- Maintainability is enhanced through centralized logic, which reduces duplication and simplifies updates. Â
- Flexibility allows the system to support multiple formats and sources without requiring a redesign. Â
- Observability is built-in with logging and audit tables that provide visibility into the pipeline’s health. Â
- Finally, Extensibility allows for the easy addition of new connectors, such as API or Kafka.Â
Challenges and ConsiderationsÂ
- Metadata Quality: Incorrect metadata leads to failures.Â
- Security: Credentials for SFTP/S3 must be managed securely (e.g., secrets manager).Â
- Performance: Avoid driver bottlenecks when processing many sources.Â
- Error Isolation: One failed source should not block others.Â
Architecture DiagramÂ
ConclusionÂ
A metadata table-driven PySpark framework provides a powerful and scalable approach to handling multi-source data ingestion. By decoupling ingestion logic from configuration, organizations can dramatically reduce development time, improve maintainability, and scale effortlessly as new data sources are introduced.Â
This design is especially effective in modern data lake architectures where flexibility, performance, and extensibility are critical. With additional enhancements like schema evolution, orchestration, and real-time ingestion, this framework can serve as the backbone of a robust enterprise data platform.Â







