GitPedia
andreskrey

andreskrey/readability.php

PHP port of Mozilla's Readability.js

18 Releases
Latest: 6y ago
v2.1.0 > The one where I realized that libxml didn't die on version v2.9.4v2.1.0Latest
andreskreyandreskrey·6y ago·July 22, 2019
GitHub

📋 Changes

  • Avoid overwriting extracted metadata with similarly named keys (like `og:image` and `og:image:width`)
  • Imported new `getSiteName()` feature from JS version as of [21 Dec 2018](https://github.com/mozilla/readability/pull/504)
  • Added getFirstElementChild function to NodeTrait + test case (Issue #83)
  • Reworked the test suit to use TestPage objects and give more hints about what failed
  • Removed getWordThreshold and setWordThreshold configuration functions
  • Added NodeUtility::filterTextNodes and deprecated NodeTrait getChildren()
  • Added new DOMNodeList fake class that mimics the original DOMNodeList class but allows to add new nodes to the list
  • Added new Dockerfiles that pulls different versions of PHP and libxml. Now we are supporting 4 versions of PHP and 6 versions of libxml!
v2.0.1 > Oopsiev2.0.1
andreskreyandreskrey·7y ago·November 27, 2018
GitHub

Oopsie. Noticed that the main image was always missing from the results? That's because I screwed it up. But fear not, it's fixed. I also updated the tests to be a little more strict so this, IN THEORY, should not happen again.

v2.0.0 > Up to date with Readability.js again + docker containersv2.0.0
andreskreyandreskrey·7y ago·November 25, 2018
GitHub

📋 Changes

  • Move phrasing contents into paragraphs
  • Improved the title detection
  • Remove single cell tables
  • Improved the detection of video related elements
  • New test cases
  • Various minor fixes
  • Clean <aside> tags during prepArticle().
  • Merged PR #58: Fix notice non-object on $parentOfTopCandidate for tumblr.com
  • + 5 more

📦 IT'S GONE

  • So make sure you run this code in a somewhat modern version of PHP. A version that starts with 7.
  • That's it. Take care. Call your mother.
v1.2.0 > Up to date with Readability.jsv1.2.0
andreskreyandreskrey·8y ago·March 19, 2018
GitHub

📋 Changes

  • Merged PR#49 (Missing object when calling `->getContent()`)
  • Imported all changes from Readability.js as of 2 March 2018 ([8525c6a](https://github.com/mozilla/readability/commit/8525c6af36d3badbe27c4672a6f2dd99ddb4097f)):
  • Check for `<base>` elements before converting URLs to absolute.
  • Clean `<link>` tags on `prepArticle()`
  • Attempt to return at least some text if all the algorithm runs fail (Check PR [#423](https://github.com/mozilla/readability/pull/423) on JS version)
  • Add new test cases for the previous changes
  • And all other changes reflected [in this diff](https://github.com/mozilla/readability/compare/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5...8525c6af36d3badbe27c4672a6f2dd99ddb4097f)
v1.1.1 > The one with small changesv1.1.1
andreskreyandreskrey·8y ago·March 12, 2018
GitHub

📋 Changes

  • Switched from assertEquals to assertSame on unit testing to avoid weak comparisons.
  • Added a safe check to avoid sending the DOMDocument as a node when scanning for node ancestors.
  • Fix issue #45: Small mistake in documentation
  • Fix issue #46: Added `data-src` as a image source path
  • Fixed bug when extracting all the image of the article (Was extracting images from the original DOM instead of the parsed one)
  • Added the `->getDOMDocument()` getter to retrieve the fully parsed DOMDocument
  • Merged PR #48 that allows passing an array as configuration (@topotru)
v1.1.0 > Say hello to optional loggingv1.1.0
andreskreyandreskrey·8y ago·January 11, 2018
GitHub

📋 Changes

  • Added 'data-orig' as an URL source for images
  • Removed 'modal' as a negative property from classes
  • Added option to inject a logger
  • Removed all references to the `data-readability` tags that don't apply anymore to the new structure
  • Merged PR #38 (Missing DOMEntityReference)
v1 🎉🎉🎉v1.0.0
andreskreyandreskrey·8y ago·December 3, 2017
GitHub

Hi all! Finally v1 is here. 🎉🎉🎉 The project changed drastically from v0, mainly because the HTMLParser is gone and the Readability class replaces it. I know, confusing, but this change aligns us with Readability.js and makes everything easier to port. Also another huge change that I wanted to do since version 0.0.1 was getting rid of the node encapsulation. v0 used league\html-to-markdown NodeElement class to encapsulate the nodes and act as a middle man between your code and the DOMDocument. This caused lots of trouble because when you encapsulate nodes, you are actually severing the relation between the original DOM and the encapsulated node, forcing you to keep track of the changes between them instead of letting the system do it. This version instead of encapsulating nodes, extends the original class, solving all these issues. Check the readme file to understand how to port your v0 code to v1 and the changelog to read about all the other changes. Enjoy!

v0.3.1
andreskreyandreskrey·8y ago·December 1, 2017
GitHub

📋 Changes

  • Trim titles when detecting hierarchical separators to avoid false negatives on strings with spaces.
  • Fix issue when converting divs to p nodes and never rating them (issue #29)
  • Fix "Unsupported operand types" (PR #31)
  • Fix division by zero when no title was found (issue #32)
  • New function to retrieve all images at once (PR #30)
  • Get the title from the `<title>` tag before searching on the `<meta>` tags
v0.3.0
andreskreyandreskrey·8y ago·November 12, 2017
GitHub

📋 Changes

  • Merged PR #24. Fixes notice when trying to extract `og:image`
  • Up to date to commit [eb221c5](https://github.com/mozilla/readability/commit/c3ff1a2d2c94c1db257b2c9aa88a4b8fbeb221c5) (2017-10-16), which includes the following changes:
  • New tags added to the unlikelyCandidates regex
  • Detection and removal of hierarchical separators in titles
  • Added more tags to clean after parsing the article (`button`, `textarea`, `select`, etc.)
  • New way to detect empty nodes (including a edge case where a node with a `&nbsp;` was detected as a node with content)
  • Better approach to find a top candidate (specially when a top candidate is the only child of a parent node, which allows a more accurate joining of sibling elements)
  • Detect text direction (`ltr` or `rtl`)
  • + 5 more
v0.2.2
andreskreyandreskrey·8y ago·September 14, 2017
GitHub

📋 Changes

  • Added a safecheck for really nasty HTML
  • Added summonCthulhu option, to remove all script tags via regex
v0.2.1
andreskreyandreskrey·9y ago·May 31, 2017
GitHub

📋 Changes

  • Added `normalizeEntities` flag to convert UTF-8 characters to its HTML Entity equivalent. Fixes bugs on htmls with mixed encoding.
  • Added more information to the readme.md file
  • New way to create a backup DOM: not creating a backup. In the previous version, the system cloned the $this->dom object to keep it as a backup in order to restart the algorithm with other flags, if needed. This seemed to work until I realized that *sometimes* the backup changes even if we are not touching it. Seems that the `dom` and `backupdom` objects are linked and *some* changes on the dom object reach the bakcupdom object. The new approach consists in deleting the backupdom object and recreating from scratch the dom object. Of course this has a performance impact, but seems to be quite low.
v0.2.0
andreskreyandreskrey·9y ago·March 10, 2017
GitHub

📋 Changes

  • Every test unit passes
  • Readability.php produces the same exact output as Readability.js
  • I'm happy :)

🐛 Fixed

  • Lots of bugs
  • Merged PR by DavidFricker to avoid exceptions while grabbing the document content

Added

  • substituteEntities flag, to avoid replacing especial characters with HTML entities. There's nothing we can do about `&nbsp;`, that entity is replaced by libxml and there's no way to disable it.
  • Named data sets so it's easier to detect which test case is failing.

🗑️ Removed

  • Couple of test cases that involved broken JS. There's nothing we can do about JS spilling onto the text.
v0.1.2
andreskreyandreskrey·9y ago·December 26, 2016
GitHub

📋 Changes

  • New way to get the metadata of the article.
v0.1.1
andreskreyandreskrey·9y ago·December 26, 2016
GitHub

📋 Changes

  • Small fix to clean style tags after creating the final article
First non-alpha versionv0.1.0
andreskreyandreskrey·9y ago·December 24, 2016
GitHub

Happy Holidays! I've finally managed to port 100% of the code and make (most) of the test cases to pass! There's a lot of work to do but the current release behaves mostly as the original JS project. Enjoy!

Lots of progress!v0.0.3-alphaPre-release
andreskreyandreskrey·9y ago·November 26, 2016
GitHub

📋 Changes

  • Added prepArticle to remove junk after selecting the top candidates.
  • Added a function to restore score after selecting top candidates. This basically works by scanning the data-readability tag and restoring the score to the contentScore variable. This is an horrible hack and should be removed once we ditch the Element interface of html-to-markdown and start extending the DOMDocument object.
  • Switched all strlen functions to mb_strlen
  • Fixed lots of bugs and pretty sure that introduced a bunch of new ones.
Last realease I'm using master as the main development branchv0.0.2-alphaPre-release
andreskreyandreskrey·9y ago·November 13, 2016
GitHub

All the current development will be done in the develop branch.

First versionv0.0.1-alphaPre-release
andreskreyandreskrey·9y ago·November 7, 2016
GitHub

Pre release of the first version. Lots to do, lots to fix. But it's a nice start!