DQOps reposted this
Reliable file ingestion depends on validating the schema, constraints, and data quality rules. File ingestion is a process that has been known since most of us were born. Anyway, it is still worth showing the stages where data quality validation should be performed. There are three steps where flat files, such as CSV, JSON, or XML, should be validated: ⚡ Verify the schema and the ability to read the file after copying it to your own raw file location. The file could be truncated (partially uploaded), or it may not have the required columns. ⚡ Use data quality checks that validate constraints, such as value uniqueness, nulls in required columns, and values not convertible to their target data types. You will detect and reject files that cannot be loaded to a typed staging table. ⚡ Run additional data quality checks defined by data stewards and business users. If some of these checks fail, but it is not a critical severity issue, load the data to the target table anyway. This process requires one more important component. You need a job orchestrator that supports restarting jobs at the step where they failed. You can still retry failed jobs if your data transformation code is wrong or the data quality checks are too restrictive. For data quality checks, use a data quality platform that is callable from the data pipelines, such as DQOps. Check my profile to learn more. #dataquality #dataengineering #datagovernance
Hey Piotr, great insights on file ingestion validation! Your breakdown of the three key steps is spot on. Have you considered leveraging AI-powered anomaly detection algorithms for enhanced data quality assurance? It could be a game-changer in streamlining the process.
Interesting
Interesting
Founder @ DQOps open-source Data Quality platform | Detect any data quality issue and watch for new issues with Data Observability
2moThe manual for testing CSV files from DQOps is here: https://dqops.com/docs/data-sources/csv/