Version 0.0.4 of Schwatcher has been released.

Changes:

  • No longer uses Akka Agent to hold CallbackRegistry (thanks crdueck). This should result in a small performance increase because of more ‘direct’ memory access inside MonitorActor.
  • Refactored testing for better coverage and maintainability
  • Scala 2.10.3 support in testing

Relevent info:

For the most part, I’m a very happy Heroku user. The platform allows me deploy my apps, be they Java, Scala or Ruby-based without having to think/worry about infrastructure, which is amazing. They also allow me to, for the most part, do this for free. For the most part, I love it, and so do many others. That said, sometimes you do run into problems that cause you to lose lots of time debugging … in the wrong direction.

Yesterday, after adding non-blocking I/O to my Metascraper library, I load tested my deployed application and found it fatally crashing with java.lang.OutOfMemoryError: unable to create new native thread errors. Not Good

Sorry for the quick version-up. Version 0.1.1 added non-blocking I/O, but was using Dispatch without configuring the threadpool used for HTTP connections. This caused issues on Heroku where there is a 256 combined thread + process limit for 1x dynos (512 for 2x dynos), whereby Java OOM “unable to create new native thread” errors would be thrown.

0.2.1 adds:

  • Configuration of Actor HTTP client on ScraperActor instantiation
    • Notably: HTTP client ExecutorService thread pool

Relevent info:

Metascraper v0.1.1 has been released. Major changes include:

  • Async / non-blocking I/O for page requests: Originally suggested by analytically, I’ve added asynchronous requesting of webpages via Dispatch
  • ScraperActor now replies with Either[Throwable, ScrapedData] whereas before it replied with Either[Throwable, ScrapedData]. This allows library users to access the full capabilities of thrown objects. This might break your app
  • Added URL validations
  • Better guessing of metadata
  • More relevant User-Agent out of the box
  • Better test coverage

Relevent info:

  • Metascraper Github repo
  • AddlibraryDependencies += "com.beachape.metascraper" %% "metascraper" % "0.1.1" into build.sbt to install

Scraping metadata (e.g. title, description, url, etc.) from a URL is something that Facebook currently does for you when you paste a URL into the “Update Status” box. For a service that I’m currently building out, we wanted to do this as well for our users. Thus Metascraper was born.

There was already a Ruby solution called link_thumbnailer, but since this is a I/O heavy operation, I knew I wanted to build a solution using tools that supported non-blocking I/O and could be used without getting caught in callback spaghetti. Scala, Akka, and the Play framework immediately came to mind.

The WatchService was added as part of Java 7 and introduced the ability to monitor files through the JVM without the use of external libraries like JNotify that require installing native libraries. Using this API for a project that requires monitoring files makes handling dependencies for both deployment and development much simpler.

Since Scala is able to directly invoke Java, I wanted to use this API when I was building Akanori-thrift, a trending-words detection service that is focused on the Japanese language. This post will not go over that service in detail (that will take up an entire post of its own if not more) but my use-case there was monitor a custom dictionary file for updates and then spawn a new instance of the Tokenizer that uses the updated state of the file.

I quickly realised a few pain-points:

  1. There existed no file monitoring Scala library (at the time),
  2. Using the WatchService API requires the use of a blocking thread to get events,
  3. The WatchService API does not have recursive monitoring support built in

To address these, I set out to create Schwatcher, a Scala library that wraps the WatchService API of Java7 and allows callbacks to be registered and unregistered on both directories and files both as individual paths and recursively. Furthermore, I wanted to facilitate the use of the Java7 API in Scala in a simple way that is in line with the functional programming paradigm.

In Scala, there are a lot of cool things - too many to list. Among them is something borrowed from Haskell; the Maybe, spelt Option in Scala, which itself is based on the concept of computing via monads.

The reason why Option is awesome is that, if used properly, it largely frees the programmer from having to worry about various variables being in states of nothingness (Nil, null, etc). Without fail, every programmer has at one point or another written things like thing.nil? ? do_nothing : do_something …. all over the place. The point (in my mind, at least) of Option is to free us from having to do this in as many places as possible.

Many libraries in Scala, such as Scala-Redis are made with the assumption that the programmer knows how to deal with Option and return results wrapped in either Some[List[T]] or None. That said, how to work with these types of results is not exactly straight forward for someone coming from other languages that don’t have such constructs, so I’ve written down some of my thoughts.

Ruby 2.0.0 was released a few months back and I finally had some time to look into some of the features and changes that came with it. Lazy collections has always been a cool concept for me and so I decided to do a few benchmarks.

Often times, as programmers, we need to check to see if a substring exists in a bigger string. Many programmers will instinctively reach out for Regex matching, but I often wondered if this was really the best way to do things, particularly in Ruby.

One day, in a Ruby-focused Skype group chat, a friend of mine asked the other members to give him a snippet of code that would allow him to take a hostname, check if the substring ‘qa’ was in it and if it did, return ‘qa’, else ‘prod’ (for production). My knee-jerk reaction was the same as the other members: use Regex. But I wondered if it would be better to use compiled Regex, or interpreted Regex, or perhaps maybe even the built in String method include?. So I decided to do some benchmarking.