Skip to content

Simple, easy-to-use scraper to scrape data from WordPress JSON API

License

Notifications You must be signed in to change notification settings

SoloSynth1/wordpress-scraper

Repository files navigation

wordpress-scraper

Description

Simple, easy-to-use scraper to scrape data from WordPress JSON API

Features

  • Support storing crawled documents as MongoDB documents / JSON files
  • Auto retry upon errors

Requirements

  • Python 3.7+

Installation

pip install -r requirements.txt

How to use

Basic

Just run crawl.py with the sites URL supplied:

python3 crawl.py https://your.website.here

This will crawl the site using DefaultCrawlSession, which attempts to crawl all posts, categories & tags from the site.

The crawled JSON files will be stored in the directory ./data/<domain-name>.

Most of the time, This will suffice when scraping sites that are:

  1. not required to sign in
  2. JSON API paths not blocked

Advanced

For advanced usage and customizations you may want to look at wpscraper/session.py for actual crawling procedures, and make your own CrawlSession.

Upcoming Features

  • Rewrite/Refactor
  • MongoDB Connector
  • Async session
  • Authentication Module
  • Cloudflare circumvention
  • Configurable retry policies
  • Full WPv2 API resources support

About

Simple, easy-to-use scraper to scrape data from WordPress JSON API

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy