Skraper
Kotlin/Java library and cli tool for scraping posts and media from various sources with neither authorization nor full page rendering (Facebook, Instagram, Twitter, Youtube, Tiktok, Telegram, Twitch, Reddit, 9GAG, Pinterest, Flickr, Tumblr, Coub, Vimeo, IFunny, VK, Odnoklassniki, Pikabu)
**skraper** is a Kotlin/Java library and cli tool for scraping posts and media from various sources with neither authorization nor full page rendering (Facebook, Instagram, Twitter, Youtube, Tiktok, Telegram, Twitch, Reddit, 9GAG, Pinterest, Flickr, Tumblr, Coub, Vimeo, IFunny, VK, Odnoklassniki, Pikabu) The project is written primarily in Kotlin, distributed under the Apache License 2.0 license, first published in 2020. Key topics include: 9gag, coub, facebook, flickr, ifunny.
Skraper
Here should be some fancy logo
Overview
Kotlin/Java library and cli tool which allows scraping and downloading posts, attachments, other meta from more than 10
sources without any authorization or full page rendering. Based on jsoup, jackson and kotlin-coroutines.
Repository contains:
Current list of implemented sources:
- Youtube
- TikTok
- Telegram
- Twitch
- 9GAG
- Flickr
- Tumblr
- Vimeo
- IFunny
- Coub
- VK
- Odnoklassniki
- Pikabu
Bugs
Unfortunately, each web-site is subject to change without any notice, so the tool may work incorrectly because of that.
If that happens, please let me know via an issue.
Cli tool
Cli tool allows to:
- download media with flag
--media-onlyfrom almost all presented sources. - scrape posts meta information
Requirements:
- Java: 1.8 +
- Maven (optional)
Build tool
bash./mvnw clean package -DskipTests=true
Usage:
bash./skraper --help
textusage: [-h] PROVIDER PATH [-n LIMIT] [-t TYPE] [-o OUTPUT] [-m] [--parallel-downloads PARALLEL_DOWNLOADS] optional arguments: -h, --help show this help message and exit -n LIMIT, --limit LIMIT posts limit (50 by default) -t TYPE, --type TYPE output type, options: [log, csv, json, xml, yaml] -o OUTPUT, --output OUTPUT output path -m, --media-only scrape media only --parallel-downloads PARALLEL_DOWNLOADS amount of parallel downloads for media items if enabled flag --media-only (4 by default) positional arguments: PROVIDER skraper provider, options: facebook, instagram, twitter, youtube, tiktok, telegram, twitch, reddit, 9gag, pinterest, flickr, tumblr, ifunny, vk, pikabu, vimeo, odnoklassniki, coub PATH path to user/community/channel/topic/trend
Examples:
bash./skraper 9gag /hot ./skraper reddit /r/memes -n 5 -t csv -o ./reddit/posts ./skraper instagram /explore/tags/memes -t json ./skraper flickr /photos/harrythehawk -t yaml ./skraper pinterest /levato/meme -t xml ./skraper youtube /user/JetBrainsTV/videos --media-only -n 2
Kotlin Library
Distribution
Maven:
xml<dependency> <groupId>ru.sokomishalov.skraper</groupId> <artifactId>skrapers</artifactId> <version>x.y.z</version> </dependency>
Gradle kotlin dsl:
kotlinimplementation("ru.sokomishalov.skraper:skrapers:x.y.z")
Usage
Instantiate specific scraper
As mentioned before, the provider implementation list is:
- FacebookSkraper
- InstagramSkraper
- TwitterSkraper
- YoutubeSkraper
- TikTokSkraper
- TelegramSkraper
- TwitchSkraper
- RedditSkraper
- NinegagSkraper
- PinterestSkraper
- FlickrSkraper
- TumblrSkraper
- VimeoSkraper
- IFunnySkraper
- CoubSkraper
- VkSkraper
- OdnoklassnikiSkraper
- PikabuSkraper
After that usage as simple as is:
kotlinval skraper = InstagramSkraper(client = OkHttpSkraperClient())
Important moment: it is highly recommended to not
use DefaultBlockingSkraperClient
. There are some more efficient, non-blocking and resource-friendly implementations
for SkraperClient. To use them you just have to put
required dependencies in the classpath.
Current http-client implementation list:
- DefaultBlockingClient:
simple java.net.* blocking api implementation - OkHttpSkraperClient: okhttp3
implementation - SpringReactiveSkraperClient: spring-webflux client
implementation - KtorSkraperClient: ktor-client-jvm
implementation
Available methods
Each scraper is a class which implements Skraper
interface:
kotlininterface Skraper { val client: SkraperClient fun getPosts(path: String): Flow<Post> suspend fun getPageInfo(path: String): PageInfo? fun supports(media: Media): Boolean suspend fun resolve(media: Media): Media }
Also, there are some provider-specific kotlin extensions for implementations. You can find them out at the provider
implementation package.
Usage from plain Java
There is an out-of-box java interop utility class ru.sokomishalov.skraper.util.JavaInterop:
javaclass Example { public static void main(String[] args) { Skraper skraper = new InstagramSkraper(); List<Post> posts = JavaInterop.limitedFlow(skraper.getPosts("/memes.video"), 10); PageInfo info = JavaInterop.callBlocking(cont -> skraper.getPageInfo("/memes.video", cont)); } }
Scrape user/community/channel/topic/trend posts
To scrape the latest posts for specific user, channel or trend use skraper like that:
kotlinsuspend fun main() { val skraper = FacebookSkraper() val posts = skraper.getUserPosts(username = "memes").take(2).toList() // extension for getPosts() // or val postsDetected = Skrapers.getPosts(url = "https://facebook.com/memes") // aggregating singleton println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(posts)) }
Received data structure is similar to each other provider's. Output data example:
json5[ { "id": "5029851093699104", "text": "gotta love em!", "publishedAt": 1580744400000, "statistics": { "likes": 79, "comments": 3 }, "media": [ { "url": "https://facebook.com/memes/posts/5029851093699104?__xts__%5B0%5D=68.ARA2yRI2YnlXQRKX7Pdphh8ztgvnP11aYE_bZFPNmqLpJZLhwJaG24gDPUTiKDLv-J_E09u2vLjCXalpmEuGSmVR0BkVtcng_i6QV8x5e-aZUv0Mkn1wwKLlhp5NNH6zQWKlqDqRjZrwvcKeUi0unzzulRCHRvDIrbz2leM6PLescFySwMYbMmKFc7ctqaC_F7nJ09Ya0lz9Pqaq_Rh6UsNKom6fqdgHAuoHV894a3QRuyY0BC6fQuXZLOLbRIfEVK3cF9Z5UQiXUYruCySF-WpQEV0k72x6DIjT6B3iovYFnBGHaji9VAx2PByZ-MDs33D1Hz96Mk-O1Pj7zBwO6FvXGhkUJgepiwUOVd0q-pV83rS5EhjtPFDylNoNO2xkDUSIi483p49vumVPWtmab8LX1V6w2anf55kh6pedCXcH3D8rBjz8DaTBnv995u9kk5im-1-HdAGQHyKrCZpaA0QyC-I4oGsCoIJGck3RO8u_SoHcfe2tKjTgPe6j9p1D&__tn__=-R", "aspectRatio": 0.864, "duration": 10860.000000000 } ] }, { "id": "4990218157662398", "text": "Interesting", "publishedAt": 1580742000000, "statistics": { "likes": 3092, "comments": 514 }, "media": [ { "url": "https://scontent.fhrk1-1.fna.fbcdn.net/v/t1.0-0/p526x296/52333452_10157743612509879_529328953723191296_n.png?_nc_cat=1&_nc_ohc=oNMb8_mCbD8AX-w9zeY&_nc_ht=scontent.fhrk1-1.fna&oh=ca8a719518ecfb1a24f871282b860124&oe=5E910D0C", "aspectRatio": 0.8960573476702509 } ] } ]
You can see the full model structure for posts and others here
Scrape user/community/channel/topic/trend info
It is possible to scrape user/channel/trend info for some purposes:
kotlinsuspend fun main() { val skraper = TwitterSkraper() val pageInfo = skraper.getUserInfo(username = "memes") // extension for `getPageInfo()` // or val pageInfoDetected = Skrapers.getPageInfo(url = "https://twitter.com/memes") // aggregating singleton println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(pageInfo)) }
Output:
json5{ "nick": "memes", "name": "Memes.com", "description": "http://memes.com is your number one website for the funniest content on the web. You will find funny pictures, funny memes and much more.", "statistics": { "posts": 10848, "followers": 154718 }, "avatar": { "url": "https://pbs.twimg.com/profile_images/824808708332941313/mJ4xM6PH_normal.jpg" }, "cover": { "url": "https://abs.twimg.com/images/themes/theme1/bg.png" } }
Resolve provider relative url
Sometimes you need to know direct media link:
kotlinsuspend fun main() { val skraper = InstagramSkraper() val info = skraper.resolve(Video(url = "https://www.instagram.com/p/B-flad2F5o7/")) val serializer = JsonMapper().writerWithDefaultPrettyPrinter() println(serializer.writeValueAsString(info)) }
Output:
json5{ "url": "https://scontent-amt2-1.cdninstagram.com/v/t50.2886-16/91508191_213297693225472_2759719910220905597_n.mp4?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_cat=104&_nc_ohc=27bC52qar_oAX-7J2Zh&oe=5EC0BC52&oh=0aafee2860c540452b76e7b8e336147d", "aspectRatio": 0.8010012515644556, "thumbnail": { "url": "https://scontent-amt2-1.cdninstagram.com/v/t51.2885-15/e35/91435498_533808773845524_5302421141680378393_n.jpg?_nc_ht=scontent-amt2-1.cdninstagram.com&_nc_cat=100&_nc_ohc=8gPAcByc6YAAX_kDBWm&oh=5edf6b9d90d606f9c0e055b7dbcbfa45&oe=5EC0DDE8", "aspectRatio": 0.8010012515644556 } }
Download media
There is "static" method which allows to download any media from all known implemented sources:
kotlinsuspend fun main() { val tmpDir = Files.createTempDirectory("skraper").toFile() val testVideo = Skrapers.download( media = Video("https://youtu.be/fjUO7xaUHJQ"), destDir = tmpDir, filename = "Gandalf" ) val testImage = Skrapers.download( media = Image("https://www.pinterest.ru/pin/89509111320495523/"), destDir = tmpDir, filename = "Do_no_harm" ) println(testVideo) println(testImage) }
Output:
log/var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Gandalf.mp4 /var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Do_no_harm.jpg
Telegram bot
To use the bot follow the link.
Contributors
Showing top 10 contributors by commit count.
