Home/andreskrey/readability.php/Changelog

andreskrey/readability.php

PHP port of Mozilla's Readability.js

18 Releases

Latest: 6y ago

v2.1.0 > The one where I realized that libxml didn't die on version v2.9.4v2.1.0Latest

andreskrey·6y ago·July 22, 2019

GitHub

📋 Changes

Avoid overwriting extracted metadata with similarly named keys (like `og:image` and `og:image:width`)
Imported new `getSiteName()` feature from JS version as of [21 Dec 2018](https://github.com/mozilla/readability/pull/504)
Added getFirstElementChild function to NodeTrait + test case (Issue #83)
Reworked the test suit to use TestPage objects and give more hints about what failed
Removed getWordThreshold and setWordThreshold configuration functions
Added NodeUtility::filterTextNodes and deprecated NodeTrait getChildren()
Added new DOMNodeList fake class that mimics the original DOMNodeList class but allows to add new nodes to the list
Added new Dockerfiles that pulls different versions of PHP and libxml. Now we are supporting 4 versions of PHP and 6 versions of libxml!

v2.0.1 > Oopsiev2.0.1

andreskrey·7y ago·November 27, 2018

GitHub

Oopsie. Noticed that the main image was always missing from the results? That's because I screwed it up. But fear not, it's fixed. I also updated the tests to be a little more strict so this, IN THEORY, should not happen again.

v2.0.0 > Up to date with Readability.js again + docker containersv2.0.0

andreskrey·7y ago·November 25, 2018

GitHub

📋 Changes

Move phrasing contents into paragraphs
Improved the title detection
Remove single cell tables
Improved the detection of video related elements
New test cases
Various minor fixes
Clean <aside> tags during prepArticle().
Merged PR #58: Fix notice non-object on $parentOfTopCandidate for tumblr.com
+ 5 more

📦 IT'S GONE

So make sure you run this code in a somewhat modern version of PHP. A version that starts with 7.
That's it. Take care. Call your mother.

v1.2.0 > Up to date with Readability.jsv1.2.0

andreskrey·8y ago·March 19, 2018

GitHub

📋 Changes

Merged PR#49 (Missing object when calling `->getContent()`)
Imported all changes from Readability.js as of 2 March 2018 ([8525c6a](https://github.com/mozilla/readability/commit/8525c6af36d3badbe27c4672a6f2dd99ddb4097f)):
Check for `<base>` elements before converting URLs to absolute.
Clean `<link>` tags on `prepArticle()`
Attempt to return at least some text if all the algorithm runs fail (Check PR [#423](https://github.com/mozilla/readability/pull/423) on JS version)
Add new test cases for the previous changes
And all other changes reflected [in this diff](https://github.com/mozilla/readability/compare/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5...8525c6af36d3badbe27c4672a6f2dd99ddb4097f)

v1.1.1 > The one with small changesv1.1.1

andreskrey·8y ago·March 12, 2018

GitHub

📋 Changes

Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
Added a safe check to avoid sending the DOMDocument as a node when scanning for node ancestors.
Fix issue #45: Small mistake in documentation
Fix issue #46: Added `data-src` as a image source path
Fixed bug when extracting all the image of the article (Was extracting images from the original DOM instead of the parsed one)
Added the `->getDOMDocument()` getter to retrieve the fully parsed DOMDocument
Merged PR #48 that allows passing an array as configuration (@topotru)

v1.1.0 > Say hello to optional loggingv1.1.0

andreskrey·8y ago·January 11, 2018

GitHub

📋 Changes

Added 'data-orig' as an URL source for images
Removed 'modal' as a negative property from classes
Added option to inject a logger
Removed all references to the `data-readability` tags that don't apply anymore to the new structure
Merged PR #38 (Missing DOMEntityReference)

v1 🎉🎉🎉v1.0.0

andreskrey·8y ago·December 3, 2017

GitHub

Hi all! Finally v1 is here. 🎉🎉🎉 The project changed drastically from v0, mainly because the HTMLParser is gone and the Readability class replaces it. I know, confusing, but this change aligns us with Readability.js and makes everything easier to port. Also another huge change that I wanted to do since version 0.0.1 was getting rid of the node encapsulation. v0 used league\html-to-markdown NodeElement class to encapsulate the nodes and act as a middle man between your code and the DOMDocument. This caused lots of trouble because when you encapsulate nodes, you are actually severing the relation between the original DOM and the encapsulated node, forcing you to keep track of the changes between them instead of letting the system do it. This version instead of encapsulating nodes, extends the original class, solving all these issues. Check the readme file to understand how to port your v0 code to v1 and the changelog to read about all the other changes. Enjoy!

v0.3.1

andreskrey·8y ago·December 1, 2017

GitHub

📋 Changes

Trim titles when detecting hierarchical separators to avoid false negatives on strings with spaces.
Fix issue when converting divs to p nodes and never rating them (issue #29)
Fix "Unsupported operand types" (PR #31)
Fix division by zero when no title was found (issue #32)
New function to retrieve all images at once (PR #30)
Get the title from the `<title>` tag before searching on the `<meta>` tags

v0.3.0

andreskrey·8y ago·November 12, 2017

GitHub

📋 Changes

Merged PR #24. Fixes notice when trying to extract `og:image`
Up to date to commit [eb221c5](https://github.com/mozilla/readability/commit/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5) (2017-10-16), which includes the following changes:
New tags added to the unlikelyCandidates regex
Detection and removal of hierarchical separators in titles
Added more tags to clean after parsing the article (`button`, `textarea`, `select`, etc.)
New way to detect empty nodes (including a edge case where a node with a ` ` was detected as a node with content)
Better approach to find a top candidate (specially when a top candidate is the only child of a parent node, which allows a more accurate joining of sibling elements)
Detect text direction (`ltr` or `rtl`)
+ 5 more

v0.2.2

andreskrey·8y ago·September 14, 2017

GitHub

📋 Changes

Added a safecheck for really nasty HTML
Added summonCthulhu option, to remove all script tags via regex

v0.2.1

andreskrey·9y ago·May 31, 2017

GitHub

📋 Changes

Added `normalizeEntities` flag to convert UTF-8 characters to its HTML Entity equivalent. Fixes bugs on htmls with mixed encoding.
Added more information to the readme.md file
New way to create a backup DOM: not creating a backup. In the previous version, the system cloned the $this->dom object to keep it as a backup in order to restart the algorithm with other flags, if needed. This seemed to work until I realized that *sometimes* the backup changes even if we are not touching it. Seems that the `dom` and `backupdom` objects are linked and *some* changes on the dom object reach the bakcupdom object. The new approach consists in deleting the backupdom object and recreating from scratch the dom object. Of course this has a performance impact, but seems to be quite low.

v0.2.0

andreskrey·9y ago·March 10, 2017

GitHub

📋 Changes

Every test unit passes
Readability.php produces the same exact output as Readability.js
I'm happy :)

🐛 Fixed

Lots of bugs
Merged PR by DavidFricker to avoid exceptions while grabbing the document content

✨ Added

substituteEntities flag, to avoid replacing especial characters with HTML entities. There's nothing we can do about ` `, that entity is replaced by libxml and there's no way to disable it.
Named data sets so it's easier to detect which test case is failing.

🗑️ Removed

Couple of test cases that involved broken JS. There's nothing we can do about JS spilling onto the text.

v0.1.2

andreskrey·9y ago·December 26, 2016

GitHub

📋 Changes

New way to get the metadata of the article.

v0.1.1

andreskrey·9y ago·December 26, 2016

GitHub

📋 Changes

Small fix to clean style tags after creating the final article

First non-alpha versionv0.1.0

andreskrey·9y ago·December 24, 2016

GitHub

Happy Holidays! I've finally managed to port 100% of the code and make (most) of the test cases to pass! There's a lot of work to do but the current release behaves mostly as the original JS project. Enjoy!

Lots of progress!v0.0.3-alphaPre-release

andreskrey·9y ago·November 26, 2016

GitHub

📋 Changes

Added prepArticle to remove junk after selecting the top candidates.
Added a function to restore score after selecting top candidates. This basically works by scanning the data-readability tag and restoring the score to the contentScore variable. This is an horrible hack and should be removed once we ditch the Element interface of html-to-markdown and start extending the DOMDocument object.
Switched all strlen functions to mb_strlen
Fixed lots of bugs and pretty sure that introduced a bunch of new ones.

Last realease I'm using master as the main development branchv0.0.2-alphaPre-release

andreskrey·9y ago·November 13, 2016

GitHub

All the current development will be done in the develop branch.

First versionv0.0.1-alphaPre-release

andreskrey·9y ago·November 7, 2016

GitHub

Pre release of the first version. Lots to do, lots to fix. But it's a nice start!

← Back to readability.php wiki