andreskrey/readability.php
PHP port of Mozilla's Readability.js
📋 Changes
- Avoid overwriting extracted metadata with similarly named keys (like `og:image` and `og:image:width`)
- Imported new `getSiteName()` feature from JS version as of [21 Dec 2018](https://github.com/mozilla/readability/pull/504)
- Added getFirstElementChild function to NodeTrait + test case (Issue #83)
- Reworked the test suit to use TestPage objects and give more hints about what failed
- Removed getWordThreshold and setWordThreshold configuration functions
- Added NodeUtility::filterTextNodes and deprecated NodeTrait getChildren()
- Added new DOMNodeList fake class that mimics the original DOMNodeList class but allows to add new nodes to the list
- Added new Dockerfiles that pulls different versions of PHP and libxml. Now we are supporting 4 versions of PHP and 6 versions of libxml!
Oopsie. Noticed that the main image was always missing from the results? That's because I screwed it up. But fear not, it's fixed. I also updated the tests to be a little more strict so this, IN THEORY, should not happen again.
📋 Changes
- Move phrasing contents into paragraphs
- Improved the title detection
- Remove single cell tables
- Improved the detection of video related elements
- New test cases
- Various minor fixes
- Clean <aside> tags during prepArticle().
- Merged PR #58: Fix notice non-object on $parentOfTopCandidate for tumblr.com
- + 5 more
📦 IT'S GONE
- So make sure you run this code in a somewhat modern version of PHP. A version that starts with 7.
- That's it. Take care. Call your mother.
📋 Changes
- Merged PR#49 (Missing object when calling `->getContent()`)
- Imported all changes from Readability.js as of 2 March 2018 ([8525c6a](https://github.com/mozilla/readability/commit/8525c6af36d3badbe27c4672a6f2dd99ddb4097f)):
- Check for `<base>` elements before converting URLs to absolute.
- Clean `<link>` tags on `prepArticle()`
- Attempt to return at least some text if all the algorithm runs fail (Check PR [#423](https://github.com/mozilla/readability/pull/423) on JS version)
- Add new test cases for the previous changes
- And all other changes reflected [in this diff](https://github.com/mozilla/readability/compare/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5...8525c6af36d3badbe27c4672a6f2dd99ddb4097f)
📋 Changes
- Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
- Added a safe check to avoid sending the DOMDocument as a node when scanning for node ancestors.
- Fix issue #45: Small mistake in documentation
- Fix issue #46: Added `data-src` as a image source path
- Fixed bug when extracting all the image of the article (Was extracting images from the original DOM instead of the parsed one)
- Added the `->getDOMDocument()` getter to retrieve the fully parsed DOMDocument
- Merged PR #48 that allows passing an array as configuration (@topotru)
📋 Changes
- Added 'data-orig' as an URL source for images
- Removed 'modal' as a negative property from classes
- Added option to inject a logger
- Removed all references to the `data-readability` tags that don't apply anymore to the new structure
- Merged PR #38 (Missing DOMEntityReference)
Hi all! Finally v1 is here. 🎉🎉🎉 The project changed drastically from v0, mainly because the HTMLParser is gone and the Readability class replaces it. I know, confusing, but this change aligns us with Readability.js and makes everything easier to port. Also another huge change that I wanted to do since version 0.0.1 was getting rid of the node encapsulation. v0 used league\html-to-markdown NodeElement class to encapsulate the nodes and act as a middle man between your code and the DOMDocument. This caused lots of trouble because when you encapsulate nodes, you are actually severing the relation between the original DOM and the encapsulated node, forcing you to keep track of the changes between them instead of letting the system do it. This version instead of encapsulating nodes, extends the original class, solving all these issues. Check the readme file to understand how to port your v0 code to v1 and the changelog to read about all the other changes. Enjoy!
📋 Changes
- Trim titles when detecting hierarchical separators to avoid false negatives on strings with spaces.
- Fix issue when converting divs to p nodes and never rating them (issue #29)
- Fix "Unsupported operand types" (PR #31)
- Fix division by zero when no title was found (issue #32)
- New function to retrieve all images at once (PR #30)
- Get the title from the `<title>` tag before searching on the `<meta>` tags
📋 Changes
- Merged PR #24. Fixes notice when trying to extract `og:image`
- Up to date to commit [eb221c5](https://github.com/mozilla/readability/commit/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5) (2017-10-16), which includes the following changes:
- New tags added to the unlikelyCandidates regex
- Detection and removal of hierarchical separators in titles
- Added more tags to clean after parsing the article (`button`, `textarea`, `select`, etc.)
- New way to detect empty nodes (including a edge case where a node with a ` ` was detected as a node with content)
- Better approach to find a top candidate (specially when a top candidate is the only child of a parent node, which allows a more accurate joining of sibling elements)
- Detect text direction (`ltr` or `rtl`)
- + 5 more
📋 Changes
- Added a safecheck for really nasty HTML
- Added summonCthulhu option, to remove all script tags via regex
📋 Changes
- Added `normalizeEntities` flag to convert UTF-8 characters to its HTML Entity equivalent. Fixes bugs on htmls with mixed encoding.
- Added more information to the readme.md file
- New way to create a backup DOM: not creating a backup. In the previous version, the system cloned the $this->dom object to keep it as a backup in order to restart the algorithm with other flags, if needed. This seemed to work until I realized that *sometimes* the backup changes even if we are not touching it. Seems that the `dom` and `backupdom` objects are linked and *some* changes on the dom object reach the bakcupdom object. The new approach consists in deleting the backupdom object and recreating from scratch the dom object. Of course this has a performance impact, but seems to be quite low.
📋 Changes
- Every test unit passes
- Readability.php produces the same exact output as Readability.js
- I'm happy :)
🐛 Fixed
- Lots of bugs
- Merged PR by DavidFricker to avoid exceptions while grabbing the document content
✨ Added
- substituteEntities flag, to avoid replacing especial characters with HTML entities. There's nothing we can do about ` `, that entity is replaced by libxml and there's no way to disable it.
- Named data sets so it's easier to detect which test case is failing.
🗑️ Removed
- Couple of test cases that involved broken JS. There's nothing we can do about JS spilling onto the text.
📋 Changes
- New way to get the metadata of the article.
📋 Changes
- Small fix to clean style tags after creating the final article
Happy Holidays! I've finally managed to port 100% of the code and make (most) of the test cases to pass! There's a lot of work to do but the current release behaves mostly as the original JS project. Enjoy!
📋 Changes
- Added prepArticle to remove junk after selecting the top candidates.
- Added a function to restore score after selecting top candidates. This basically works by scanning the data-readability tag and restoring the score to the contentScore variable. This is an horrible hack and should be removed once we ditch the Element interface of html-to-markdown and start extending the DOMDocument object.
- Switched all strlen functions to mb_strlen
- Fixed lots of bugs and pretty sure that introduced a bunch of new ones.
All the current development will be done in the develop branch.
Pre release of the first version. Lots to do, lots to fix. But it's a nice start!
