The repository is dedicated to the Debias project, dedicated to showing relationships between different concepts in the news.
We cover different geographical locations (mainly USA and UK), different political positions (taken from AllSides) and various news providers.
The final goal is to create an interactive visualization, which would show how concepts are interconnected within different time stamps and from different points of view.
Scaper is a service which scrapers news from different news providers. This service is recursively calling itself to scrape the next news pages.
If page requires rendering, it will be sent to the renderer
service. If page is static, it is stored in the s3
service, metadata is stored in the metastore
service, and a processor
service is called to process the page.
Renderer is a service which renders news pages using browser API. It is called by the scraper
service. After render, it saves HTML content to the s3
service and metadata to the metastore
service and sends a request to the processor
service to process the page.
Processor is a service which processes news pages. It extracts human-readable text from the page, performs NLP pipelines and stores the results in the wordstore
service.
- Classifier
- A zero-shot classifier from HuggingFace Transformers. In particular,
MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli
due to it's comparably low size.
- A zero-shot classifier from HuggingFace Transformers. In particular,
- Extractor
- A keyword exctraction algorithm with SpaCy. SpaCy is used to extract Named Entities, which are used as keywords after processing.
Web server which serves the results of the processor
. It aggregates the statistics of the words, precomputes and caches aggregations, and serves them to the client. It serves the frontend files as well.
A postgres database which stores metadata of the scraped pages.
A S3 provider which stores the static pages. Could be a local MinIO deployment or an external S3 cloud service.
A postgres database which stores the processed pages, keywords, topics, and their corresponding frequencies.
A NATS message queue which is used for S2S communication.
The Javascript visualization is available at https://debias.dartt0n.ru/
- Create
.env
file Fill in the following variables:
PG_USERNAME=...
PG_PASSWORD=...
- Create configuration files
debias/scraper/config.toml
debias/server/config.toml
debias/processor/config.toml
debias/renderer/config.toml
Note
You can find example configuration in the following files:
- Pre-download ML models
mkdir models
uv run --group processor download-models.py
- Run services
docker compose -f docker-compose.yml up --build --detach
- Create
.env
file Fill in the following variables:
MINIO_ACCESS_KEY=...
MINIO_SECRET_KEY=...
MINIO_BUCKET=...
PG_USERNAME=...
PG_PASSWORD=...
- Create configuration files
debias/scraper/config.toml
debias/server/config.toml
debias/processor/config.toml
debias/renderer/config.toml
Note
You can find example configuration in the following files:
- Pre-download ML models
mkdir models
uv run --group processor download-models.py
- Create MinIO S3 service using docker:
docker compose -f minio.docker-compose.yml up minio_setup
The following services could be automatically scaled horizontally for better performance:
- scraper
- renderer
- processor
For easy scaling use docker-compose --scale
option.
E.g., the following command will launch 5 scaper instances, 2
rendererinstances. 2
processor` instances:
docker compose up --detach \
--scale scaper=5 \
--scale renderer=2 \
--scale processor=2\
To stop all remove all containers AND THEIR VOLUMES:
docker compose -f minio.docker-compose.yml down --volumes
# or
docker compose -f docker-compose.yml down --volumes
.
├── debias # shared code root
│ ├── core # reusable components - s3, metastore, configs, etc
│ └── scraper # scraper related code
│ └── processor # NLP processor related code
│ └── renderer # browser renderer related code
│ └── server # server related code
│ └── frontend # frontend related code
To add new service:
- Create new directory in
debias
directory - Create
dockerfile
prefixed withservicename
(e.g.scraper.dockerfile
) - Add all the required dependencies to
pyproject.toml
under--group servicename
- Add new package to
tool.hatch.build.targets.wheel
config inpyproject.toml
- Create
.env
file Fill in the following variables:
PG_USERNAME=...
PG_PASSWORD=...
- Launch database container Using docker-compose:
docker compose up -d database
- Generate random data
Set environment variable
POSTGRES_CONNECTION
to the connection string of the database (replaceUSERNAME
andPASSWORD
with your actual username and password):
POSTGRES_CONNECTION="postgresql://USERNAME:$PASSWORD$@localhost:5432/postgres" uv run generate-data.py
- Create server configuration file
config.toml
ReplaceUSERNAME
andPASSWORD
with your actual username and password:
[pg]
connection = "postgresql://${PG_USERNAME}:${PG_PASSWORD}$@localhost:5432/postgres"
- Launch backend server with hot reload
CONFIG=config.toml uv run litestar --app debias.server:app run --debug --reload
We have collected 38 sources of news from USA and UK and found out their political positions.
It seems left parties are indeed more liberal.
We have parsed several news articles using python and prepared a deployment describing general trends in these articles.
The deployment can be found on Github Pages
The visualization is divided into 3 parts:
- Comparison of topics distribution for Left-Leaning and Right-Leaning media.
- Comparison of keywords networks for Left-Leaning and Right-Leaning media.
- Sandbox network with filtering functionality.
All visualizations are created using D3.js.
You can view the visualization at https://debias.dartt0n.ru/