roniemartinez/dude
dude uncomplicated data extraction: A simple framework for writing web scrapers using Python decorators
📋 What's Changed
- ⬆️ Bump mypy from 1.5.0 to 1.5.1 by @dependabot in https://github.com/roniemartinez/dude/pull/410
- ⬆️ Bump docker/setup-buildx-action from 2.9.1 to 2.10.0 by @dependabot in https://github.com/roniemartinez/dude/pull/414
- ⬆️ Bump pytest from 7.4.0 to 7.4.1 by @dependabot in https://github.com/roniemartinez/dude/pull/418
- ⬆️ Bump mkdocstrings from 0.22.0 to 0.23.0 by @dependabot in https://github.com/roniemartinez/dude/pull/419
- ⬆️ Bump mkdocs-material from 9.1.21 to 9.2.8 by @dependabot in https://github.com/roniemartinez/dude/pull/421
- ⬆️ Bump sigstore/cosign-installer from 3.1.1 to 3.1.2 by @dependabot in https://github.com/roniemartinez/dude/pull/420
- ⬆️ Bump actions/checkout from 3 to 4 by @dependabot in https://github.com/roniemartinez/dude/pull/422
- ⬆️ Bump docker/build-push-action from 4.1.1 to 4.2.0 by @dependabot in https://github.com/roniemartinez/dude/pull/424
- + 81 more
📋 What's Changed
- 🔧 Enable Poetry virtualenv by @roniemartinez in https://github.com/roniemartinez/dude/pull/391
📦 Dependencies
- ⬆️ Bump pyproject-flake8 from 3.9.2 to 5.0.4.post1 by @dependabot in https://github.com/roniemartinez/dude/pull/380
- ⬆️ Bump isort from 5.11.5 to 5.12.0 by @dependabot in https://github.com/roniemartinez/dude/pull/379
- Bump pytest from 7.3.2 to 7.4.0 by @dependabot in https://github.com/roniemartinez/dude/pull/383
- Bump mypy from 1.3.0 to 1.4.1 by @dependabot in https://github.com/roniemartinez/dude/pull/385
- Bump mkdocs-material from 9.1.15 to 9.1.17 by @dependabot in https://github.com/roniemartinez/dude/pull/386
- Bump sigstore/cosign-installer from 3.0.5 to 3.1.1 by @dependabot in https://github.com/roniemartinez/dude/pull/388
- Bump docker/setup-buildx-action from 2.7.0 to 2.8.0 by @dependabot in https://github.com/roniemartinez/dude/pull/389
- ⬆️ Update dependencies by @roniemartinez in https://github.com/roniemartinez/dude/pull/394
- + 15 more
📋 What's Changed
- 🔥 Drop Python 3.7 by @roniemartinez in https://github.com/roniemartinez/dude/pull/378
📦 Dependencies
- ⬆️ Bump mkdocs-material from 9.1.13 to 9.1.14 by @dependabot in https://github.com/roniemartinez/dude/pull/364
- ⬆️ Bump types-pyyaml from 6.0.12.9 to 6.0.12.10 by @dependabot in https://github.com/roniemartinez/dude/pull/365
- ⬆️ Bump pytest-cov from 4.0.0 to 4.1.0 by @dependabot in https://github.com/roniemartinez/dude/pull/366
- ⬆️ Bump mkdocstrings from 0.21.2 to 0.22.0 by @dependabot in https://github.com/roniemartinez/dude/pull/367
- ⬆️ Bump mkdocs-material from 9.1.14 to 9.1.15 by @dependabot in https://github.com/roniemartinez/dude/pull/368
- ⬆️ Bump docker/setup-buildx-action from 2.5.0 to 2.6.0 by @dependabot in https://github.com/roniemartinez/dude/pull/370
- ⬆️ Bump docker/metadata-action from 4.4.0 to 4.5.0 by @dependabot in https://github.com/roniemartinez/dude/pull/371
- ⬆️ Bump docker/login-action from 2.1.0 to 2.2.0 by @dependabot in https://github.com/roniemartinez/dude/pull/372
- + 6 more
📋 What's Changed
- 🔧 Sort optional dependencies by @roniemartinez in https://github.com/roniemartinez/dude/pull/363
- 📚 Fix badge by @roniemartinez in https://github.com/roniemartinez/dude/pull/349
📦 Dependencies
- ⬆️ Bump mkdocs-material from 9.1.3 to 9.1.4 by @dependabot in https://github.com/roniemartinez/dude/pull/326
- ⬆️ Bump types-pyyaml from 6.0.12.8 to 6.0.12.9 by @dependabot in https://github.com/roniemartinez/dude/pull/327
- ⬆️ Bump black from 23.1.0 to 23.3.0 by @dependabot in https://github.com/roniemartinez/dude/pull/329
- ⬆️ Bump mkdocs-material from 9.1.4 to 9.1.5 by @dependabot in https://github.com/roniemartinez/dude/pull/330
- ⬆️ Bump mkdocstrings from 0.20.0 to 0.21.2 by @dependabot in https://github.com/roniemartinez/dude/pull/334
- ⬆️ Bump beautifulsoup4 from 4.12.0 to 4.12.1 by @dependabot in https://github.com/roniemartinez/dude/pull/331
- ⬆️ Bump mypy from 1.1.1 to 1.2.0 by @dependabot in https://github.com/roniemartinez/dude/pull/333
- ⬆️ Bump pytest from 7.2.2 to 7.3.0 by @dependabot in https://github.com/roniemartinez/dude/pull/338
- + 23 more
📋 What's Changed
- ✨ Follow redirects by @roniemartinez in https://github.com/roniemartinez/dude/pull/325
📦 Dependencies
- ⬆️ Bump docker/setup-buildx-action from 2.4.1 to 2.5.0 by @dependabot in https://github.com/roniemartinez/dude/pull/322
- ⬆️ Bump mkdocs-material from 9.1.1 to 9.1.2 by @dependabot in https://github.com/roniemartinez/dude/pull/321
- ⬆️ Bump mkdocs-material from 9.1.2 to 9.1.3 by @dependabot in https://github.com/roniemartinez/dude/pull/323
- ⬆️ Bump beautifulsoup4 from 4.11.2 to 4.12.0 by @dependabot in https://github.com/roniemartinez/dude/pull/324
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.24.0...0.25.0
📋 What's Changed
- ✨ Support generator handlers by @roniemartinez in https://github.com/roniemartinez/dude/pull/320
📦 Dependencies
- ⬆️ Bump webdriver-manager from 3.8.4 to 3.8.5 by @dependabot in https://github.com/roniemartinez/dude/pull/292
- ⬆️ Bump pytest from 7.2.0 to 7.2.1 by @dependabot in https://github.com/roniemartinez/dude/pull/295
- ⬆️ Bump respx from 0.20.0 to 0.20.1 by @dependabot in https://github.com/roniemartinez/dude/pull/294
- ⬆️ Update dependencies by @roniemartinez in https://github.com/roniemartinez/dude/pull/296
- ⬆️ Bump mkdocs-material from 9.0.11 to 9.0.12 by @dependabot in https://github.com/roniemartinez/dude/pull/299
- ⬆️ Bump types-pyyaml from 6.0.12.4 to 6.0.12.5 by @dependabot in https://github.com/roniemartinez/dude/pull/298
- ⬆️ Bump types-pyyaml from 6.0.12.5 to 6.0.12.6 by @dependabot in https://github.com/roniemartinez/dude/pull/300
- ⬆️ Bump certifi from 2022.5.18.1 to 2022.12.7 by @dependabot in https://github.com/roniemartinez/dude/pull/303
- + 16 more
💥 Breaking change
- 🔥 Drop Pyppeteer support by @roniemartinez in https://github.com/roniemartinez/dude/pull/278
📦 Other
- 🔨 Reformat for mypy, reformat GA yml files by @roniemartinez in https://github.com/roniemartinez/dude/pull/291
- 💚 Add Github Actions to dependabot.yml by @roniemartinez in https://github.com/roniemartinez/dude/pull/264
- 🐛 Fix badge by @roniemartinez in https://github.com/roniemartinez/dude/pull/279
📦 Dependencies
- ⬆️ Bump pybrowsers from 0.4.1 to 0.5.0 by @dependabot in https://github.com/roniemartinez/dude/pull/188
- ⬆️ Bump pybrowsers from 0.5.0 to 0.5.1 by @dependabot in https://github.com/roniemartinez/dude/pull/191
- ⬆️ Bump selenium-wire from 4.6.4 to 4.6.5 by @dependabot in https://github.com/roniemartinez/dude/pull/189
- ⬆️ Bump webdriver-manager from 3.8.0 to 3.8.1 by @dependabot in https://github.com/roniemartinez/dude/pull/190
- ⬆️ Bump types-pyyaml from 6.0.9 to 6.0.10 by @dependabot in https://github.com/roniemartinez/dude/pull/192
- ⬆️ Bump playwright from 1.23.0 to 1.23.1 by @dependabot in https://github.com/roniemartinez/dude/pull/193
- ⬆️ Bump webdriver-manager from 3.8.1 to 3.8.2 by @dependabot in https://github.com/roniemartinez/dude/pull/194
- ⬆️ Bump mypy from 0.961 to 0.971 by @dependabot in https://github.com/roniemartinez/dude/pull/195
- + 59 more
📋 What's Changed
- ⬆️ Bump mkdocs-material from 8.3.3 to 8.3.4 by @dependabot in https://github.com/roniemartinez/dude/pull/175
- ⬆️ Bump mkdocs-material from 8.3.4 to 8.3.5 by @dependabot in https://github.com/roniemartinez/dude/pull/176
- ⬆️ Bump mkdocs-material from 8.3.5 to 8.3.6 by @dependabot in https://github.com/roniemartinez/dude/pull/177
- ⬆️ Bump mkdocs-material from 8.3.6 to 8.3.7 by @dependabot in https://github.com/roniemartinez/dude/pull/178
- ⬆️ Bump webdriver-manager from 3.7.0 to 3.7.1 by @dependabot in https://github.com/roniemartinez/dude/pull/181
- ⬆️ Bump mkdocs-material from 8.3.7 to 8.3.8 by @dependabot in https://github.com/roniemartinez/dude/pull/179
- ⬆️ Bump types-pyyaml from 6.0.8 to 6.0.9 by @dependabot in https://github.com/roniemartinez/dude/pull/180
- ⬆️ Bump black from 22.3.0 to 22.6.0 by @dependabot in https://github.com/roniemartinez/dude/pull/182
- + 6 more
📋 What's Changed
- 🐛 Fix memory leak by @roniemartinez in https://github.com/roniemartinez/dude/pull/174
- ⬆️ Bump mkdocs-material from 8.3.1 to 8.3.2 by @dependabot in https://github.com/roniemartinez/dude/pull/171
- ⬆️ Bump mypy from 0.960 to 0.961 by @dependabot in https://github.com/roniemartinez/dude/pull/172
- ⬆️ Bump mkdocs-material from 8.3.2 to 8.3.3 by @dependabot in https://github.com/roniemartinez/dude/pull/173
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.21.0...0.21.1
📋 What's Changed
- ✨ ChromeDriver version selection by @roniemartinez in https://github.com/roniemartinez/dude/pull/170
- 🐛 Fix mkdocstrings by @roniemartinez in https://github.com/roniemartinez/dude/pull/167
- ⬆️ Bump mkdocs-material from 8.2.16 to 8.3.0 by @dependabot in https://github.com/roniemartinez/dude/pull/168
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.20.3...0.21.0
📋 What's Changed
- ⬆️ Bump braveblock from 0.2.0 to 0.3.0 by @dependabot in https://github.com/roniemartinez/dude/pull/159
- ⬆️ Bump mypy from 0.950 to 0.960 by @dependabot in https://github.com/roniemartinez/dude/pull/161
- ⬆️ Bump mkdocs-material from 8.2.15 to 8.2.16 by @dependabot in https://github.com/roniemartinez/dude/pull/164
- ⬆️ Bump mkdocstrings from 0.18.1 to 0.19.0 by @dependabot in https://github.com/roniemartinez/dude/pull/163
- ⬆️ Bump lxml from 4.8.0 to 4.9.0 by @dependabot in https://github.com/roniemartinez/dude/pull/165
- ⬆ Update dependencies by @roniemartinez in https://github.com/roniemartinez/dude/pull/166
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.20.2...0.20.3
📋 What's Changed
- 🐛 Fix helper imports by @roniemartinez in https://github.com/roniemartinez/dude/pull/158
- ⬆️ Bump httpx from 0.22.0 to 0.23.0 by @dependabot in https://github.com/roniemartinez/dude/pull/156
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.20.1...0.20.2
📋 What's Changed
- 💚 Set latest tag in docker/build-push-action by @roniemartinez in https://github.com/roniemartinez/dude/pull/153
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.20.0...0.20.1
📋 What's Changed
- 🐳 Docker image by @roniemartinez in https://github.com/roniemartinez/dude/pull/152
📦 Other
- ⬆️ Bump mkdocs-material from 8.2.13 to 8.2.14 by @dependabot in https://github.com/roniemartinez/dude/pull/148
- ⬆️ Bump selenium-wire from 4.6.3 to 4.6.4 by @dependabot in https://github.com/roniemartinez/dude/pull/149
- ⬆️ Bump playwright from 1.21.0 to 1.22.0 by @dependabot in https://github.com/roniemartinez/dude/pull/150
- ⬆️ Bump mkdocs-material from 8.2.14 to 8.2.15 by @dependabot in https://github.com/roniemartinez/dude/pull/151
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.19.0...0.20.0
📋 What's Changed
- ✨ Follow dynamically-built URLs by @roniemartinez in https://github.com/roniemartinez/dude/pull/146
- 🔨 Add ignore robots.txt warning by @roniemartinez in https://github.com/roniemartinez/dude/pull/145
📦 Dependencies
- ⬆️ Bump beautifulsoup4 from 4.11.0 to 4.11.1 by @dependabot in https://github.com/roniemartinez/dude/pull/136
- ⬆️ Bump mkdocs-material from 8.2.8 to 8.2.9 by @dependabot in https://github.com/roniemartinez/dude/pull/135
- ⬆️ Bump pyproject-flake8 from 0.0.1a3 to 0.0.1a4 by @dependabot in https://github.com/roniemartinez/dude/pull/137
- ⬆️ Bump playwright from 1.20.1 to 1.21.0 by @dependabot in https://github.com/roniemartinez/dude/pull/138
- ⬆️ Bump types-pyyaml from 6.0.5 to 6.0.6 by @dependabot in https://github.com/roniemartinez/dude/pull/139
- ⬆️ Bump types-pyyaml from 6.0.6 to 6.0.7 by @dependabot in https://github.com/roniemartinez/dude/pull/140
- ⬆️ Bump pytest from 7.1.1 to 7.1.2 by @dependabot in https://github.com/roniemartinez/dude/pull/141
- ⬆️ Bump mkdocs-material from 8.2.9 to 8.2.11 by @dependabot in https://github.com/roniemartinez/dude/pull/142
- + 4 more
📋 What's Changed
- ✨ Follow robots.txt rules with option to ignore by @roniemartinez in https://github.com/roniemartinez/dude/pull/134
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.17.0...0.18.0
📋 What's Changed
- ✨ Rename url to url_match and support function/lambda as matcher by @roniemartinez in https://github.com/roniemartinez/dude/pull/131
- ⬆️ Bump beautifulsoup4 from 4.10.0 to 4.11.0 by @dependabot in https://github.com/roniemartinez/dude/pull/132
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.16.0...0.17.0
📋 What's Changed
- ✨ Support custom HTTP methods by @roniemartinez in https://github.com/roniemartinez/dude/pull/130
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.15.2...0.16.0
📋 What's Changed
- 🐛 Fix HTTPX async event hook by @roniemartinez in https://github.com/roniemartinez/dude/pull/129
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.15.1...0.15.2
📋 What's Changed
- ⬆️ Fix dependency gridlock by @roniemartinez in https://github.com/roniemartinez/dude/pull/128
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.15.0...0.15.1
📋 What's Changed
- 🔨 Run adblock on HTTPX request event hook by @roniemartinez in https://github.com/roniemartinez/dude/pull/126
- docs: add roniemartinez as a contributor for maintenance, code, doc, infra by @allcontributors in https://github.com/roniemartinez/dude/pull/125
✨ New Contributors
- @allcontributors made their first contribution in https://github.com/roniemartinez/dude/pull/125
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.14.0...0.15.0
📋 What's Changed
- ✨ Use [fnmatch](https://docs.python.org/3/library/fnmatch.html) by @roniemartinez in https://github.com/roniemartinez/dude/pull/122
📦 Other
- ⬆️ Bump pyproject-flake8 from 0.0.1a2 to 0.0.1a3 by @dependabot in https://github.com/roniemartinez/dude/pull/120
- ⬆️ Bump black from 22.1.0 to 22.3.0 by @dependabot in https://github.com/roniemartinez/dude/pull/121
📦 fnmatch: URL pattern matcher now uses Unix style wildcards (fnmatch) instead of regex
- See: https://docs.python.org/3/library/fnmatch.html
- Wildcards are easier to understand and simpler to use compared to regular expressions
- ```diff
- @select(css=".title", url=r".*\.com")
- + @select(css=".title", url="*.com/*")
- def result_title(element):
- return {"title": element.text_content()}
- ```
- + 1 more
📋 What's Changed
- ✨ Make return value of decorated functions optional by @roniemartinez in https://github.com/roniemartinez/dude/pull/119
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.12.2...0.13.0
📋 What's Changed
- 🐛 Fix PlaywrightScraper overwriting output file by @roniemartinez in https://github.com/roniemartinez/dude/pull/118
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.12.1...0.12.2
📋 What's Changed
- 🔨 Refactor for Alpha by @roniemartinez in https://github.com/roniemartinez/dude/pull/112
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.12.0...0.12.1
📋 What's Changed
- ✨ Add shutdown event and save per page option by @roniemartinez in https://github.com/roniemartinez/dude/pull/102
📦 Other
- ⬆️ Bump playwright from 1.20.0 to 1.20.1 by @dependabot in https://github.com/roniemartinez/dude/pull/101
- ⬆️ Bump mypy from 0.941 to 0.942 by @dependabot in https://github.com/roniemartinez/dude/pull/104
- ⬆️ Bump mkdocs-material from 8.2.6 to 8.2.7 by @dependabot in https://github.com/roniemartinez/dude/pull/105
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.11.0...0.12.0
- You can now save data after scraping a page. Save functions should be decorated with `is_per_page=True` and execute the scraper with `--save-per-page` to use it.
- ```python
- @save("jsonl", is_per_page=True)
- def save_jsonl(data, output) -> bool:
- + 12 more
✨ Features
- ✨ Events by @roniemartinez in https://github.com/roniemartinez/dude/pull/99
- 🔗 Follow URLs by @roniemartinez in https://github.com/roniemartinez/dude/pull/90
📝 Documentation
- 📚 Update docs by @roniemartinez in https://github.com/roniemartinez/dude/pull/93
🐛 Fixes
- 💚 Fix Actions rate limit error by @roniemartinez in https://github.com/roniemartinez/dude/pull/81
- 🐛 Fix DevToolsActivePort file doesn't exist by @roniemartinez in https://github.com/roniemartinez/dude/pull/84
- 🐛 Fix selenium failing on Windows by @roniemartinez in https://github.com/roniemartinez/dude/pull/94
📦 Other
- ⬆️ Bump selenium-wire from 4.6.2 to 4.6.3 by @dependabot in https://github.com/roniemartinez/dude/pull/80
- ⬆️ Bump mypy from 0.931 to 0.941 by @dependabot in https://github.com/roniemartinez/dude/pull/82
- ⬆️ Bump pytest from 7.0.1 to 7.1.0 by @dependabot in https://github.com/roniemartinez/dude/pull/78
- ⬆️ Bump braveblock from 0.1.13 to 0.2.0 by @dependabot in https://github.com/roniemartinez/dude/pull/83
- ⬆️ Bump playwright from 1.19.1 to 1.20.0 by @dependabot in https://github.com/roniemartinez/dude/pull/87
- ⬆️ Bump types-pyyaml from 6.0.4 to 6.0.5 by @dependabot in https://github.com/roniemartinez/dude/pull/88
- ⬆️ Bump pytest from 7.1.0 to 7.1.1 by @dependabot in https://github.com/roniemartinez/dude/pull/91
- ⬆️ Bump webdriver-manager from 3.5.3 to 3.5.4 by @dependabot in https://github.com/roniemartinez/dude/pull/97
- + 1 more
📦 Example
- ```console
- dude scrape ... --follow-urls
- ```
- or
- ```python
- if __name__ == "__main__":
- import dude
- dude.run(..., follow_urls=True)
- + 2 more
📦 Example
- ```python
- import uuid
- from pathlib import Path
- from dude import post_setup, pre_setup, startup
- SAVE_DIR: Path
- @startup()
- def initialize_csv():
- """
- + 24 more
📦 Diagram showing when events are executed
- 
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.10.1...0.11.0
📋 What's Changed
- 🏁 Fix Windows support by @roniemartinez in https://github.com/roniemartinez/dude/pull/76
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.10.0...0.10.1
✨ Added
- ✨ Block ads by @roniemartinez in https://github.com/roniemartinez/dude/pull/74
📋 Changed
- 🔨 Refactor and update docs by @roniemartinez in https://github.com/roniemartinez/dude/pull/75
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.9.2...0.10.0
📋 What's Changed
- 🔧 Disable notifications by @roniemartinez in https://github.com/roniemartinez/dude/pull/73
- Full Changelog: https://github.com/roniemartinez/dude/compare/0.9.1...0.9.2
