Skip to content
This repository was archived by the owner on Mar 31, 2025. It is now read-only.

avrtt/gnomych

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

⚠️ Deprecation warning: this toolkit is now a part of the Paysage library starting from 1.2 version. The avrtt/gnomych repository is no longer supported.


Here you can find a tool that automates data cleaning tasks, validates raw data using rule-based constraints, provides data profiling and reporting, and offers automated correction suggestions for common data issues. It's designed to be used both as a standalone tool and as an integrated component in ETL pipelines.

This project is a part of my freelance work that was published with the client's permission.

Features

  • Data cleaning

    • Removal of duplicates
    • Missing-value imputation (mean, median, mode, constant)
    • Column name standardization
    • Outlier detection using z‑score and IQR methods
  • Validation

    • JSON-schema based row validation
    • Custom business rule validations
  • Profiling & reporting

    • Missing values and summary statistics reports
    • Outlier profiling
    • Generation of comprehensive reports in Markdown and HTML
  • Automated correction suggestions

    • Imputation strategy recommendations
    • Outlier handling suggestions (clip/remove)
    • Automated application of corrections

Installation

  1. Clone the repository:

    git clone https://github.com/avrtt/gnomych.git
    cd gnomych
  2. Create a virtual environment and activate it:

    python -m venv venv
    source venv/bin/activate # Windows: venv\Scripts\activate
  3. Install the package:

    pip install -e .

Usage

To run the command-line tool:

gnomych --input path/to/input.csv --report output_report.md

This will read the CSV file, perform data cleaning, generate a profiling report and save the report in Markdown format.

Running Tests

To run the tests:

python -m unittest discover tests

Project structure

.
├── README.md
├── .gitignore
├── setup.py
├── requirements.txt
├── gnomych/
│   ├── __init__.py
│   ├── __main__.py
│   ├── cleaning.py
│   ├── validation.py
│   ├── profiling.py
│   ├── reporting.py
│   ├── correction.py
│   ├── exceptions.py
│   └── utils.py
└── tests/
    ├── test_cleaning.py
    ├── test_validation.py
    ├── test_profiling.py
    ├── test_reporting.py
    └── test_correction.py

License

MIT.

About

Cleans and validates raw data against predefined rules

Topics

Resources

License

Stars

Watchers

Forks

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy