Scraping metadata (e.g.
url, etc.) from a URL is something that Facebook currently does for you when you paste a URL into the “Update Status” box. For a service that I’m currently building out, we wanted to do this as well for our users. Thus Metascraper was born.
There was already a Ruby solution called link_thumbnailer, but since this is a I/O heavy operation, I knew I wanted to build a solution using tools that supported non-blocking I/O and could be used without getting caught in callback spaghetti. Scala, Akka, and the Play framework immediately came to mind.
Before I started building my own solution, I did some research and found that there were already some web-scraping solutions written in Scala or Java, such as chafed, and some more listed in this StackOverflow question.
I wanted something more focused, something that would “intelligently” return a page’s title, description, urls, and images back. I also wanted to make sure that if the page implemented the Open Graph Protocol, the information from those tags got prioritised. Since these requirements were not being fulfilled by existing Scala libraries, I set about creating my own Scala library.
The main components of the Metascraper library include:
- Akka actors
- jsoup: While there were Scala web scrapers, the Java solution, jsoup, was very mature and easy to use.
Basic workflow (a.k.a. how to use)
This post won’t go over in too much detail how to use the library because that stuff is available from the Metascraper Github page and will probably change over time, but this is the basic workflow:
- Instantiate a
- Send a message to the scraper with
- When scraping is done, the actor will reply with a
The project is Mavenised and is availale from the Central Repository, so simply add the
libraryDependency in your
build.sbt (when you read this the versioning might be different so refer to the project’s Github page):
And to use it,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
I’ve created an example Play2 application that integrates this library, called metascraper-service. Feel free to take a look !
Please give Metascraper a test drive and submit issues and pull requests !