Skip to content

Data-Wrangling-and-Visualisation/DeBias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

De Bias

Table of Contents

Overview

The repository is dedicated to the Debias project, dedicated to showing relationships between different concepts in the news.

We cover different geographical locations (mainly USA and UK), different political positions (taken from AllSides) and various news providers.

The final goal is to create an interactive visualization, which would show how concepts are interconnected within different time stamps and from different points of view.

overiew diagram

Technologies

  • Python
  • Docker
  • Redis
  • MinIO
  • NATS
  • Postgres
  • Playwright
  • Litestar
  • Polars
  • D3.js

NLP:

  • Transfrormers
  • SpaCy

Services

Architecture

Scaper is a service which scrapers news from different news providers. This service is recursively calling itself to scrape the next news pages. If page requires rendering, it will be sent to the renderer service. If page is static, it is stored in the s3 service, metadata is stored in the metastore service, and a processor service is called to process the page.

Renderer is a service which renders news pages using browser API. It is called by the scraper service. After render, it saves HTML content to the s3 service and metadata to the metastore service and sends a request to the processor service to process the page.

Processor is a service which processes news pages. It extracts human-readable text from the page, performs NLP pipelines and stores the results in the wordstore service.

NLP pipeline

  • Classifier
    • A zero-shot classifier from HuggingFace Transformers. In particular, MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli due to it's comparably low size.
  • Extractor
    • A keyword exctraction algorithm with SpaCy. SpaCy is used to extract Named Entities, which are used as keywords after processing.

Web server which serves the results of the processor. It aggregates the statistics of the words, precomputes and caches aggregations, and serves them to the client. It serves the frontend files as well.

Metastore

A postgres database which stores metadata of the scraped pages.

S3

A S3 provider which stores the static pages. Could be a local MinIO deployment or an external S3 cloud service.

Wordstore

A postgres database which stores the processed pages, keywords, topics, and their corresponding frequencies.

Message queue

A NATS message queue which is used for S2S communication.

Deploy

The Javascript visualization is available at https://debias.dartt0n.ru/

Using external S3 provider

  1. Create .env file Fill in the following variables:
PG_USERNAME=...
PG_PASSWORD=...
  1. Create configuration files
  • debias/scraper/config.toml
  • debias/server/config.toml
  • debias/processor/config.toml
  • debias/renderer/config.toml
  1. Pre-download ML models
mkdir models
uv run --group processor download-models.py
  1. Run services
docker compose -f docker-compose.yml up --build --detach

Using local S3 provider

  1. Create .env file Fill in the following variables:
MINIO_ACCESS_KEY=...
MINIO_SECRET_KEY=...
MINIO_BUCKET=...

PG_USERNAME=...
PG_PASSWORD=...
  1. Create configuration files
  • debias/scraper/config.toml
  • debias/server/config.toml
  • debias/processor/config.toml
  • debias/renderer/config.toml
  1. Pre-download ML models
mkdir models
uv run --group processor download-models.py
  1. Create MinIO S3 service using docker:
docker compose -f minio.docker-compose.yml up minio_setup

Scale services for better performance!

The following services could be automatically scaled horizontally for better performance:

  • scraper
  • renderer
  • processor

For easy scaling use docker-compose --scale option.

E.g., the following command will launch 5 scaper instances, 2 rendererinstances. 2processor` instances:

docker compose up --detach \
  --scale scaper=5 \
  --scale renderer=2 \
  --scale processor=2\

Remove services

To stop all remove all containers AND THEIR VOLUMES:

docker compose -f minio.docker-compose.yml down --volumes
# or
docker compose -f docker-compose.yml down --volumes

Development

Structure

.
├── debias          # shared code root
│   ├── core        # reusable components - s3, metastore, configs, etc
│   └── scraper     # scraper related code
│   └── processor   # NLP processor related code
│   └── renderer    # browser renderer related code
│   └── server      # server related code
│       └── frontend   # frontend related code

Adding new service

To add new service:

  1. Create new directory in debias directory
  2. Create dockerfile prefixed with servicename (e.g. scraper.dockerfile)
  3. Add all the required dependencies to pyproject.toml under --group servicename
  4. Add new package to tool.hatch.build.targets.wheel config in pyproject.toml

Frontend Development

  1. Create .env file Fill in the following variables:
PG_USERNAME=...
PG_PASSWORD=...
  1. Launch database container Using docker-compose:
docker compose up -d database
  1. Generate random data Set environment variable POSTGRES_CONNECTION to the connection string of the database (replace USERNAME and PASSWORD with your actual username and password):
POSTGRES_CONNECTION="postgresql://USERNAME:$PASSWORD$@localhost:5432/postgres" uv run generate-data.py
  1. Create server configuration file config.toml Replace USERNAME and PASSWORD with your actual username and password:
[pg]
connection = "postgresql://${PG_USERNAME}:${PG_PASSWORD}$@localhost:5432/postgres"
  1. Launch backend server with hot reload
CONFIG=config.toml uv run litestar --app debias.server:app run --debug --reload

EDA

We have collected 38 sources of news from USA and UK and found out their political positions.

Distribution of political positions overall

Distribution of political positions in the USA

Distribution of political positions in the UK

Bonus: Distribution of political positions of sources which require VPN

It seems left parties are indeed more liberal.

We have parsed several news articles using python and prepared a deployment describing general trends in these articles.

The deployment can be found on Github Pages

Visualization

The visualization is divided into 3 parts:

  1. Comparison of topics distribution for Left-Leaning and Right-Leaning media.
  2. Comparison of keywords networks for Left-Leaning and Right-Leaning media.
  3. Sandbox network with filtering functionality.

All visualizations are created using D3.js.

You can view the visualization at https://debias.dartt0n.ru/

Visualization example

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy