Introducing Metascraper - a Scala Library for Scraping Page Metadata

Scraping metadata (e.g. title, description, url, etc.) from a URL is something that Facebook currently does for you when you paste a URL into the “Update Status” box. For a service that I’m currently building out, we wanted to do this as well for our users. Thus Metascraper was born.

There was already a Ruby solution called link_thumbnailer, but since this is a I/O heavy operation, I knew I wanted to build a solution using tools that supported non-blocking I/O and could be used without getting caught in callback spaghetti. Scala, Akka, and the Play framework immediately came to mind.

Existing solutions

Before I started building my own solution, I did some research and found that there were already some web-scraping solutions written in Scala or Java, such as chafed, and some more listed in this StackOverflow question.

I wanted something more focused, something that would “intelligently” return a page’s title, description, urls, and images back. I also wanted to make sure that if the page implemented the Open Graph Protocol, the information from those tags got prioritised. Since these requirements were not being fulfilled by existing Scala libraries, I set about creating my own Scala library.

Metascraper Components

The main components of the Metascraper library include:

Akka actors
jsoup: While there were Scala web scrapers, the Java solution, jsoup, was very mature and easy to use.

Basic workflow (a.k.a. how to use)

This post won’t go over in too much detail how to use the library because that stuff is available from the Metascraper Github page and will probably change over time, but this is the basic workflow:

Instantiate a ScraperActor
Send a message to the scraper with ScrapeUrl(url: String)
When scraping is done, the actor will reply with a Either[FailedToScrapeUrl,ScrapedData]

The project is Mavenised and is availale from the Central Repository, so simply add the libraryDependency in your build.sbt (when you read this the versioning might be different so refer to the project’s Github page):

libraryDependencies += "com.beachape.metascraper" %% "metascraper" % "0.0.2"

And to use it,

Metascraper example code (metascraper_example.scala) download

import akka.actor.ActorSystem
import com.beachape.metascraper.Messages._
import com.beachape.metascraper.ScraperActor
import scala.concurrent.Await
import akka.pattern.ask
import akka.util.Timeout
import scala.concurrent.duration._

implicit val timeout = Timeout(30 seconds)

implicit val system = ActorSystem("actorSystem")
implicit val dispatcher = system.dispatcher

val scraperActor = system.actorOf(ScraperActor())

for {
  future <- ask(scraperActor, ScrapeUrl("https://bbc.co.uk")).mapTo[Either[FailedToScrapeUrl,ScrapedData]]
} {
  future match {
    case Left(failed) => {
      println("Failed: ")
      println(failed.message)
    }
    case Right(data) => {
      println("Image urls")
      data.imageUrls.foreach(println)
    }
  }
}

/*
 #=>
  Image URLs:
  http://www.bbc.co.uk/img/iphone.png
  http://sa.bbc.co.uk/bbc/bbc/s?name=SET-COUNTER&pal_route=index&ml_name=barlesque&app_type=web&language=en-GB&ml_version=0.16.1&pal_webapp=wwhp&blq_s=3.5&blq_r=3.5&blq_v=default-worldwide
  http://static.bbci.co.uk/frameworks/barlesque/2.51.2/desktop/3.5/img/blq-blocks_grey_alpha.png
  http://static.bbci.co.uk/frameworks/barlesque/2.51.2/desktop/3.5/img/blq-search_grey_alpha.png
  http://news.bbcimg.co.uk/media/images/69612000/jpg/_69612953_69612952.jpg
*/

Example application

I’ve created an example Play2 application that integrates this library, called metascraper-service. Feel free to take a look !

Conclusion

Please give Metascraper a test drive and submit issues and pull requests !

Existing solutions

Metascraper Components

Basic workflow (a.k.a. how to use)

Example application

Conclusion

Comments