Skip to content

RGGH/exa-data

Repository files navigation

Example ETL project using FHIR standard data

FHIR is a standard for health care data exchange, published by HL7®

Purpose of code : Transfer data to structured format (SQL) from 80 sample JSON files, with circa 50,000 lines in each file.

User guide :

TLDR;

  • Clone this repo : git clone https://github.com/RGGH/exa-data.git
  • Install Docker to create the database and initialize 'patient_info' table with required fields
  • Run python code (2 files below)

python3 exa_flatten_json_csv.py

python3 exa_parse_for_sql.py

Full Steps:

1. Install Docker Compose

https://docs.docker.com/compose/install/

docker-compose --version
*If you get an error :
(env) rag@rag-Latitude-5490:~/env/exa-data-1/exa-data$ sudo docker-compose up
Creating network "exa-data_default" with the default driver
ERROR: could not find an available, non-overlapping IPv4 address pool among the defaults to assign to the network

Stop your openvpn :
sudo service openvpn stop

2. Run Docker Comopse

sudo docker-compose up

3. Install Python Packages

pip3 install -r requirements.txt

4. Run the 2 Python files

  • These have been kept separate to allow for batch conversion from JSON to CSV, and then import into SQL

python3 exa_flatten_json_csv.py
python3 exa_parse_for_sql.py

5. Check the imported data in in PostgreSQL

python3 db_connect.py

About this project

  • used Docker compose with yaml file to create container for a PostgreSQL 14 image
  • used regex to extract nested keys from 50,000 lines of JSON
  • used Vim to remove empty rows
  • used Pandas to remove duplicates but preserve sort order
  • used Vim to add data types in bulk, plus commas
  • used Jupyter Notebook for exploratory data anlaysis and testing
  • unable to find data dictionary on FHIR site for this actual project http://hl7.org/fhir/overview-dev.html http://hl7.org/fhir/resourcelist.html
  • not all fields populated due to time constraint, have done first few in order
  • had issues with lists inside dicts. Resolved this with 'typing' module https://docs.python.org/3/library/typing.html#module-typing
ToDo
  • Add extra tests
  • Fix issues flagged in pylint with github workflow
  • Consider more than 1 table for data, eg patients table, plus a join with notes each fullUrl
  • Create frontend in HTML/JavaScript and consume data from an API (yet to be written, in Python with FastAPI)
  • add remaining fieids in exa_parse_for_sql.py

screenshot

About

ETL : FHIR sample data JSON to SQL

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy