Skip to content

Web scraper for extracting data from online newspapers

Notifications You must be signed in to change notification settings

hamidurrk/epaper-scraper

Repository files navigation

E-paper Scraper

This is a Python-based web scraper for extracting data from online newspapers. Tested on Jugantor and Prothom Alo newspaper from 2012 to 2024 in a Windows 11 machine.

Installation

Prerequisites

Before installing the scraper, ensure you have the following prerequisites:

  • A Windows machine (Tested on Windows 11)
  • Python 3.12.0 installed on your system
  • Anaconda for installing some packages
  • NVIDIA CUDA Toolkit 12.4 for GPU accelerated processes
  • Firefox installed as your browser. Make sure to install it in C:\Program Files\Mozilla Firefox\

Installation Steps

CUDA Toolkits

Only NVIDIA GPUs are supported for now and the ones which are listed on this page. If your graphics card has CUDA cores, then you can proceed further with setting up things. If not, contact the developer.

  1. Make sure that Nvidia drivers are upto date.

  2. Add anaconda to the environment and run the following commands in the command prompt.

conda install numba
conda install cudatoolkit

NOTE: If Anaconda is not added to the environment then navigate to anaconda installation and locate the Scripts directory and open the command prompt there.

Tesseract

  1. Download the Tesseract OCR executable from here.

  2. Install Tesseract OCR by following the installation instructions provided in the repository. Make sure to install it in C:\Program Files (x86)\Tesseract-OCR.

  3. Open a command prompt or Anaconda prompt.

  4. Navigate to the directory where you have cloned or downloaded the epaper-scraper repository.

  5. Create and activate a virtual environment (optional but recommended):

    python -m venv venv
    venv\Scripts\activate
  6. Install the required Python packages using pip:

    pip install -r requirements.txt
  7. Test if Tesseract OCR is installed correctly by opening a Python prompt and running:

    import pytesseract
    print(pytesseract)

    If you don't encounter any errors, Tesseract OCR is installed successfully.

Usage

There are two ways to use this software: With GUI and Without GUI.

To use the epaper-scraper With GUI, follow these steps:

  1. Run main.py from src, which will initiate a desktop application like the following one:

Epaper Scraper Interface

  1. Navigate through the interface for using the supported capabilities of the software.

Note: The GUI lacks advanced features which are available in the "Without GUI" version. The interface is being constantly updated to implement these features.

To use the advanced features of epaper-scraper Without GUI, follow these steps:

  1. Click and run start_firefox.bat file. Alternatively run the commands from cmd.txt. This will initilize a firefox browser instance.

  2. Call functions and adjust parameters from the python files of src and run. Example:

    python main.py
  3. The scraper will start extracting data from the specified newspaper website and save it to the specified output directory.

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy