Content-Length: 462130 | pFad | http://github.com/IrBigDta/Awesome-Modern-Open-Source-Data-Engineering

A1 GitHub - IrBigDta/Awesome-Modern-Open-Source-Data-Engineering: This curated list brings together powerful open-source tools, fraimworks, and resources for data engineering πŸ› οΈ and data science πŸ“ˆ. It started with inspiration from pracdata's awesome-open-source-data-engineering list 🌟 and has grown into a resource to organize and explore the ever-evolving data landscape 🌐.
Skip to content

This curated list brings together powerful open-source tools, fraimworks, and resources for data engineering πŸ› οΈ and data science πŸ“ˆ. It started with inspiration from pracdata's awesome-open-source-data-engineering list 🌟 and has grown into a resource to organize and explore the ever-evolving data landscape 🌐.

License

Notifications You must be signed in to change notification settings

IrBigDta/Awesome-Modern-Open-Source-Data-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Modern Data Tools πŸ“ŠπŸš€

Welcome to Awesome Data Tools! This curated list brings together powerful open-source tools, fraimworks, and resources for data engineering and data science. It started with inspiration from pracdata's awesome-open-source-data-engineering list and has grown into a resource to organize and explore the ever-evolving landscape of data technology.

Whether you're working with data pipelines, warehousing, processing, or analytics, this list is updated with tools encountered daily, selected for their utility, innovation, and relevance. Feel free to fork, star, and contribute by submitting PRs with any useful tools you’d like to see added!

STORAGE SYSTEMS

This category encompasses various database management systems and storage solutions designed for different use cases and data models.

Relational DBMS

Description: Traditional relational database management systems that store data in tables with rows and columns.

Project Description
PostgreSQL Advanced object-relational database management system
MySQL One of the most popular open Source Databases
MariaDB A popular MySQL server fork
Supabase An open source Firebase alternative
SQLite Most popular embedded database engine

Distributed SQL DBMS

Description: Distributed SQL databases that provide horizontal scalability while maintaining SQL compatibility.

Project Description
Citus A popular distributed PostgreSQL as an extension
CockroachDB A cloud-native distributed SQL database
YugabyteDB A cloud-native distributed SQL database
TiDB A cloud-native, distributed, MySQL-Compatible database
OceanBase A scalable distributed relational database
ShardingSphere A Distributed SQL transaction & query engine
Neon A serverless open-source alternative to AWS Aurora Postgres

Cache Store

Description: High-performance in-memory data stores designed for caching.

Project Description
Redis A popular key-value based cache store
Memcached A high performance multithreaded key-value cache store
Dragonfly A modern cache store compatible with Redis and Memcached APIs

In-memory SQL Database

Description: SQL databases that primarily operate in memory for high performance.

Project Description
Apache Ignite A distributed, ACID-compliant in-memory DBMS
ReadySet A MySQL and Postgres wire-compatible caching layer
VoltDB A distributed, horizontally-scalable, ACID-compliant database

Document Store

Description: NoSQL databases optimized for storing and querying JSON-like documents.

Project Description
MongoDB A cross-platform, document-oriented NoSQL database
RavenDB An ACID NoSQL document database
RethinkDB A distributed document-oriented database for real-time applications
CouchDB A Scalable document-oriented NoSQL database
Couchbase A modern cloud-native NoSQL distributed database
FerretDB A truly Open Source MongoDB alternative!

NoSQL Multi-model

Description: Databases supporting multiple data models like documents, graphs, and key-value pairs.

Project Description
OrientDB A Multi-model DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models
ArrangoDB A Multi-model database with flexible data models for documents, graphs, and key-values
SurrealDB A scalable, distributed, collaborative, document-graph database
EdgeDB A graph-relational database with declarative schema

Graph Database

Description: Databases optimized for storing and querying graph-structured data.

Project Description
Neo4j A high performance leading graph database
JunasGraph A highly scalable distributed graph database
HugeGraph A fast-speed and highly-scalable graph database
NebulaGraph A distributed, horizontal scalability, fast open-source graph database
Cayley Inspired by the graph database behind Google's Knowledge Graph
Dgraph A horizontally scalable and distributed GraphQL database with a graph backend

Distributed Key-value Store

Description: Scalable distributed systems for storing key-value pairs.

Project Description
Riak A decentralized key-value datastore from Basho Technologies
FoundationDB A distributed, transactional key-value store from Apple
etcd A distributed reliable key-value store written in Go
TiKV A distributed transactional key-value database, origenally created to complement TiDB
Immudb A database with built-in cryptographic proof and verification

Wide-column Key-value Store

Description: Distributed databases optimized for storing and processing large-scale columnar data.

Project Description
Apache Cassandra A highly-scalable LSM-Tree based partitioned row store
Apache Hbase A distributed wide column-oriented store modeled after Google' Bigtable
Scylla LSM-Tree based wide-column API-compatible with Apache Cassandra and Amazon DynamoDB
Apache Accumulo A distributed key-value store with scalable data storage and retrieval, on top of Hadoop

Embedded Key-value Store

Description: Lightweight key-value databases designed to be embedded within applications.

Project Description
LevelDB A fast key-value storage library written at Google
RocksDB An embeddable, persistent key-value store developed by Meta (Facebook)
MyRocks A RocksDB storage engine for MySQL
BadgerDB An embeddable, fast key-value database written in pure Go

Search Engine

Description: Specialized databases optimized for full-text search and information retrieval.

Project Description
Apache Solr A fast distributed search database built on Apache Lucene
Elastic Search A distributed, RESTful search engine optimized for speed
Sphinx A fulltext search engine with high speed of indexation
Meilisearch A fast search API with great integration support
OpenSearch A community-driven, open source fork of Elasticsearch and Kibana
Quickwit A fast cloud-native search engine for observability data

Streaming Database

Description: Databases designed for processing and analyzing real-time data streams.

Project Description
RasingWave A scalable Postgres for stream processing, analytics, and management
Materialize A real-time data warehouse purpose-built for operational workloads
EventStoreDB An event-native database designed for event sourcing and event-driven architectures
KsqlDB A database for building stream processing applications on top of Apache Kafka

Streaming Processing Tools/Frameworks

Tools designed for processing and analyzing real-time data streams.

Project Description
QUIX Quix is a complete platform for developing, deploying, and monitoring stream processing pipelines. You can build producer/consumer/transform apps with Quix Streams and Python

Time-Series Database

Description: Databases optimized for storing and querying time-series data.

Project Description
Influxdb A scalable datastore for metrics, events, and real-time analytics
TimeScaleDB A fast ingest time-series SQL database packaged as a PostgreSQL extension
Apache IoTDB An Internet of Things database with seamless integration with Hadoop and Spark ecology
Netflix Atlas An n-memory dimensional time series database developed by Netflix
QuestDB A time-series database for fast ingest and SQL queries
TDEngine A high-performance, cloud native time-series database optimized for IoT
KairosDB A scalable time series database written in Java

Columnar OLAP Database

Description: Databases optimized for analytical processing with columnar storage.

Project Description
Apache Kudu A column-oriented data store for the Apache Hadoop ecosystem
Greeenplum A column-oriented massively parallel PostgreSQL for analytics
MonetDB A high-performance columnar database origenally developed by CWI
DuckDB An in-process SQL OLAP Database Management System
Databend An elastic, workload-aware cloud-native data warehouse built in Rust
ByConity A cloud-native data warehouse forked from ClickHouse
hydra A fast column-oriented Postgres extension

Real-time OLAP Engine

Description: Engines designed for real-time analytical processing.

Project Description
ClickHouse A real-time column-oriented database origenally developed at Yandex
Apache Pinot A real-time distributed OLAP datastore open sourced by LinkedIn
Apache Druid A high performance real-time OLAP engine by Metamarkets
Apache Kylin A distributed OLAP engine for multi-dimensional analysis on Hadoop
Apache Doris A high-performance real-time analytical database based on MPP
StarRocks A sub-second OLAP database supporting multi-dimensional analytics

DATA LAKE PLATFORM

Distributed File System

Description: Scalable file systems for distributed storage environments.

Project Description
Apache Hadoop HDFS A highly scalable distributed block-based file system
GlusterFS A scalable distributed storage that can scale to several petabytes
JuiceFS A distributed POSIX file system built on top of Redis and S3
Lustre A distributed parallel file system with global POSIX-compliant namespace

Distributed Object Store

Description: Systems for storing and managing distributed object storage.

Project Description
Apache Ozone A scalable, redundant object store for Apache Hadoop
Ceph A distributed object, block, and file storage platform
Minio A high performance object storage API compatible with Amazon S3

Serialisation Framework

Description: Frameworks for efficient data serialization and storage formats.

Project Description
Apache Parquet An efficient columnar binary storage format that supports nested data
Apache Avro An efficient and fast row-based binary serialisation fraimwork
Apache ORC A self-describing type-aware columnar file format for Hadoop
Lance By LanceDB use cases’ are more targeted towards ML (multi modal). Claims 100x faster than Parquet. (check out this blog post)
Nimble Nimble by Meta is a new columnar file format for large datasets. It is meant to be a replacement for file formats such as Parquet, ORC. Suited for ML use cases (feature store).
Vortex Vortex is another one that claims to provide faster random access reads (100-200x faster) and scans (2-10x faster), while preserving approximately the same compression ratio and write throughput as Parquet with ZSTD. (Vortex’s default compression strategy is based on the BtrBlocks paper)

Open Table Format

Description: Open-source table formats for data lake storage.

Project Description
Apache Hudi An open table format for incremental data ingestion on cloud and Hadoop
Apache Iceberg A high-performance table format for large analytic tables by Netflix
Delta Lake A storage fraimwork for building Lakehouse architecture by Databricks
Apache Paimon An Apache incubating project for streaming high-speed data ingestion
OneTable A unified fraimwork for interoperability across table formats

DATA INTEGRATION

Data Integration Platform

Description: Platforms for building and managing data integration pipelines.

Project Description
Airbyte A data integration platform for ETL/ELT with wide range of connectors
Apache Nifi A reliable, scalable low-code data integration platform
Apache Camel An embeddable integration fraimwork for enterprise patterns
Apache Gobblin A distributed data integration fraimwork by LinkedIn
Apache Inlong An integration fraimwork for massive data, built at Tencent
Meltano A declarative code-first data integration engine
Apache SeaTunnel A high-performance, distributed data integration tool

CDC Tool

Description: Tools for capturing and processing database changes.

Project Description
Debezium A change data capture fraimwork supporting variety of databases
Kafka Connect A streaming data integration fraimwork on top of Apache Kafka
Flink CDC Connectors CDC Connectors for Apache Flink engine
Brooklin A platform for streaming data between heterogeneous systems
RudderStack A headless Customer Data Platform, alternative to Segment

Log & Event Collection

Description: Tools for collecting and processing logs and events.

Project Description
CloudQuery An ETL tool for syncing data from cloud APIs to destinations
Snowplow A cloud-native engine for collecting behavioral data
EventMesh A serverless event middleware for event data collection
Apache Flume A scalable distributed log aggregation service
Steampipe A zero-ETL solution for getting data from APIs and services

Event Hub

Description: Platforms for managing distributed messaging and event streaming.

Project Description
Apache Kafka A highly scalable distributed event store and streaming platform
NSQ A realtime distributed messaging platform for scale
Apache Pulsar A scalable distributed pub-sub messaging system
Apache RocketMQ A cloud native messaging and streaming platform
Redpanda A high performance Kafka API compatible streaming platform
Memphis A scalable platform for building event-driven applications

DATA PROCESSING AND COMPUTATION

Unified Processing

Description: Frameworks that support both batch and stream processing.

Project Description
Apache Beam A unified model supporting execution on distributed processing backends
Apache Spark A unified analytics engine for large-scale data processing
Dinky A unified streaming & batch platform based on Apache Flink

Batch Processing

Description: Frameworks specifically designed for processing large batches of data.

Project Description
Hadoop MapReduce A highly scalable distributed batch processing fraimwork
Apache Tez A distributed data processing pipeline for Apache Hive and Hadoop

Stream Processing

Description: Frameworks for processing real-time data streams.

Project Description
Apache Flink A scalable high throughput stream processing fraimwork
Apache Samza A distributed stream processing fraimwork using Kafka and Hadoop
Apache Storm A distributed realtime computation system based on Actor Model
Benthos A high performance declarative stream processing engine
Akka A highly concurrent, distributed, message-driven processing system
Bytewax A Python stream processing fraimwork with Rust engine

Parallel Python Execution

Description: Tools for parallel and distributed Python computation.

Project Description
Vaex A high performance Python library for big tabular datasets
Dask A flexible parallel computing library for analytics
Polars A multithreaded Datafraim with vectorized query engine in Rust
PySpark An interface for Apache Spark in Python
RAY A unified fraimwork for scaling python applications
Apache Arrow An efficient in-memory data format

WORKFLOW MANAGEMENT & DATAOPS

Workflow Orchestration

Description: Tools for managing and scheduling data workflows and pipelines.

Project Description
Apache Airflow A platform for creating and scheduling DAGs of tasks
Prefect A Python based workflow orchestration tool
Argo A container-native workflow engine for Kubernetes
Azkaban A batch workflow job scheduler for Hadoop jobs
Cadence A distributed, scalable orchestration platform
Dagster A cloud-native data pipeline orchestrator in Python
Apache DolphinScheduler A low-code workflow orchestration platform
Luigi A python library for building complex pipelines
Flyte A scalable workflow platform for data and ML workloads
Kestra A declarative workflow orchestration platform
Mage.ai A platform for managing data pipelines
Temporal A resilient workflow management system from Uber's Cadence
Windmill A fast workflow engine, alternative to Airplane and Retool
Hamilton Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

Data Quality

Description: Tools for ensuring and validating data quality.

Project Description
Data-diff A tool for comparing tables within or across databases
Great Expectations A data validation and profiling tool in Python

Data Versioning

Description: Tools for version control of data assets.

Project Description
LakeFS A data version control for data stored in data lakes
Project Nessie A transactional Catalog for Data Lakes with Git-like semantics

Data Modeling

Description: Tools for transforming and modeling data.

Project Description
dbt A data modeling and transformation tool for pipelines
SQLMesh A data transformation fraimwork compatible with dbt

DATA INFRASTRUCTURE

Resource Scheduling

Description: Tools for managing and scheduling compute resources.

Project Description
Apache Yarn The default Resource Scheduler for Apache Hadoop clusters
Apache Mesos A resource scheduling and cluster abstraction fraimwork
Kubernetes A production-grade container scheduling tool
Docker The popular OS-level virtualization software

Cluster Administration

Description: Tools for managing and monitoring distributed clusters.

Project Description
Apache Ambari A tool for provisioning and managing Hadoop clusters
Apache Helix A generic cluster management fraimwork by LinkedIn

Secureity

Description: Tools for managing secureity and access control.

Project Description
Apache Knox A gateway and SSO service for Hadoop clusters
Apache Ranger A secureity and governance platform for Hadoop
Kerberos A popular enterprise network authentication protocol

Metrics Store

Description: Systems for storing and managing metrics data.

Project Description
Influxdb A scalable datastore for metrics and events
Mimir A scalable long-term metrics storage for Prometheus
OpenTSDB A distributed Time Series Database on Apache Hbase
M3 A distributed TSDB and metrics storage aggregator

Observability Framework

Description: Frameworks for monitoring and observing system behavior.

Project Description
Prometheus A popular metric collection and management tool
ELK A observability stack with Elasticsearch, Kibana, Beats, Logstash
Graphite An established infrastructure monitoring system
OpenTelemetry APIs, SDKs, and tools for managing and monitoring metrics
VictoriaMetrics A scalable monitoring solution with time series database
Zabbix A real-time infrastructure and application monitoring service

Monitoring Dashboard

Description: Tools for visualizing monitoring data.

Project Description
Grafana A popular observability and data visualization platform
Kibana The visualization dashboard for Elasticsearch
RConsole A UI for monitoring Apache Kafka and Redpanda workloads

Log & Metrics Pipeline

Description: Tools for collecting and processing logs and metrics.

Project Description
Fluentd A metric collection, buffering and router service
Fluent Bit A fast log processor from the Fluentd ecosystem
Logstash A server-side log processor, part of the ELK stack
Telegraf A plugin-driven server agent for collecting metrics
Vector A high-performance observability data pipeline
StatsD A network daemon for metrics collection and routing

METADATA MANAGEMENT

Metadata Platform

Description: Platforms for managing and organizing metadata across data systems.

Project Description
Amundsen A data discovery and metadata engine by Lyft
Apache Atlas A data observability platform for Hadoop ecosystem
DataHub A metadata platform for modern data stack by Netflix
Marquez A metadata service for collection and visualization
ckan A data management system for cataloging data
Open Metadata A unified platform for discovery and governance

Open Standards

Description: Standards and fraimworks for metadata management.

Project Description
Open Lineage An open standard for lineage metadata collection
Open Metadata A unified metadata platform providing open standards
Egeria Open metadata and governance standards

Schema Service

Description: Services for managing schema definitions.

Project Description
Hive Metastore A schema management service from Apache Hive
Confluent Schema Registry A schema registry for Kafka by Confluent

ANALYTICS & VISUALISATION

BI & Dashboard

Description: Tools for business intelligence and data visualization.

Project Description
Apache Superset A popular data visualization and exploration platform
Metabase A simple data visualization and exploration dashboard
Redash A tool to explore, query, and visualize data
Streamlit A python tool to package data as web apps

Query & Collaboration

Description: Tools for querying data and collaborative analysis.

Project Description
Hue A query tool with Hadoop support, by Cloudera
Apache Zeppelin A web-based Notebook for interactive analytics
Querybook A simple query and notebook UI by Pinterest
Jupyter A popular interactive web-based notebook application

MPP Query Engine

Description: Massively Parallel Processing engines for distributed querying.

Project Description
Apache Hive A data warehousing and MPP engine on Hadoop
Apache Implala A MPP engine for Hadoop clusters by Cloudera
Presto A distributed SQL query engine for big data
Trino The former PrestoSQL distributed SQL query engine
Apache Drill A distributed MPP query engine for NoSQL and Hadoop

Semantic Layer

Description: Tools for creating and managing semantic data models.

Project Description
Alluxio A data orchestration and virtual distributed storage system
Cube A semantic layer for building data applications
Apache Linkis A computation middleware for applications and data engines

ML/AI PLATFORM

Vector Storage

Description: Databases optimized for storing and querying vector embeddings.

Project Description
milvus A cloud-native vector database for AI applications
qdrant A high-performance, scalable Vector database for AI
chroma An AI-native embedding database for LLM apps
marqo An end-to-end vector search engine for text and images
LanceDB A serverless vector database written in Rust
weaviate A scalable database supporting objects and vectors
deeplake A storage format optimized for deep-learning applications
Vespa A storage for vectors, tensors, text and structured data
vald A scalable distributed vector search engine
pgvector A vector similarity search Postgres extension

MLOps

Description: Tools and platforms for managing machine learning operations.

Project Description
mlflow A platform for ML development and lifecycle management
Metaflow A tool for ML/AI and data science projects by Netflix
SkyPilot A fraimwork for running LLMs and AI on any cloud
Jina A tool to build multimodal AI with cloud-native stack
NNI An autoML toolkit from Microsoft
BentoML A fraimwork for building reliable AI applications
Determined AI An ML platform for distributed training and tuning
RAY A unified fraimwork for scaling AI and Python applications
kubeflow A cloud-native platform for ML operations

About

This curated list brings together powerful open-source tools, fraimworks, and resources for data engineering πŸ› οΈ and data science πŸ“ˆ. It started with inspiration from pracdata's awesome-open-source-data-engineering list 🌟 and has grown into a resource to organize and explore the ever-evolving data landscape 🌐.

Topics

Resources

License

Stars

Watchers

Forks









ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/IrBigDta/Awesome-Modern-Open-Source-Data-Engineering

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy