Docker, Linux & Azure – Entity extraction in a few mintues

Yes, Having Azure, Docker and Linux in the same sentences is something that you don’t see every day. But with the latest addition from Microsoft, not only that it is possible but you can have any docker container you want running as a fully hosted web app on Azure in only a few minutes. Want to find out what I’m talking about? continue reading …

CLIFF+CLAVIN – A lightweight entity extraction and geotagging library

While searching the internet for easy to use libraries I came across an interesting looking option – CLAVIN, which stands for Cartographic Location And Vicinity INdexer, seemed like a great project. The project process unstructured text documents and extracts entities such as locations from gazetteer records. It creates a Lucene index from quick lookups and even employs fuzzy searches for misspelled location names.

CLIFF is an improved version of CLAVIN, using Stanford NER module and wrapped as web server exposing REST API for HTTP requests. You can find more information about it at MediaMeter site and learn why it is called CLIFF.

Azure Deployment

Both CLIFF and CLAVIN are written in Java and although running Java applications on Azure is not that complicated, the complicated deployment script for actually making CLIFF work is a bit too much.

Part of the installation requires downloading the latest locations file from geonames.org and building the Lucene index. It also requires a symbolic link with a specific name for the index directory. A nice vagrant script can be found at CLIFF-up repository but it requires setting up a VM. As my goal is to avoid IaaS solutions at all cost (who wants to define network cards, set up public IPs and worry about scaling up or out when the times comes), I had to find a PaaS solution. Enter Docker to the rescue.

Luckily, some really nice people created docker containers for both projects:

For CLAVIN – eliotjordan/docker-clavin

For CLIFF – havlicek/cliff-docker

Now, while there’s docker support for Azure,  it still requires a VM and some configuration effort.

Azure web app on Linux (preview)

Microsoft released the ability to deploy web apps running directly on Linux. And although it is still in preview and has some limitations (see below), it opens up a whole new world of opportunities.

When creating your web app, you can choose which runtime stack to use – It can be either Node.js, PHP, .NET Core, or Ruby. Applications can be deployed in various ways and web app scaling options have built-in support.

A great power of the new feature is that instead of choosing a runtime stack, you have the ability to choose a docker container from either a public or private repository and have it installed for you automatically by Azure.

Important note: The web app being created expose the container on port 80. In order to map port 80 to a different port exposed by the container (usually done as part of ‘docker run -p’ command), it is required to add a PORT configuration parameter where the value is the exposed port of the container. For CLIFF, that port is 8080.

You can also explore your container by using ‘bash’ from the Kudo console of your web app. You can access it directly with
https://YOUR_WEB_APP_NAME.scm.azurewebsites.net/DebugConsole/Default.cshtml

So, after specifying the cliff-docker container while setting up my Linux web app, all I had to do is wait for it to be ready. And a lot of waiting it was. The container itself weighs about 900MB and after the index is created it is about 7.2 GB in size. So while Azure had to download all the different components, extract them and build the index there is nothing to do and sadly there is also no indication that something is going on. It is also important to note that CLIFF is memory hungry, so a minimum of 4GB RAM is required for holding the index in-memory and achieving good query performance.

Only after a few long hours, there was an indication that it is working properly when I received the tomcat default web page at the root URL of the server.

For actually using the CLIFF API, use this address (note it is CASE SENSITIVE):

https://YOUR_WEB_APP_NAME.azurewebsites.net/CLIFF-2.3.0/parse/text?q=Some clever text mentioning places like New Delhi, and people like Einstein.  Perhaps also we want to mention an organization like the United Nations?

And here are the results:

{
  "results": {
    "organizations": [
      {
        "count": 1,
        "name": "United Nations"
      }
    ],
    "places": {
      "focus": {
        "cities": [
          {
            "id": 1261481,
            "lon": 77.22445,
            "name": "New Delhi",
            "score": 1,
            "countryGeoNameId": "1269750",
            "countryCode": "IN",
            "featureCode": "PPLC",
            "featureClass": "P",
            "stateCode": "07",
            "lat": 28.63576,
            "stateGeoNameId": "1273293",
            "population": 317797
          }
        ],
        "states": [
          {
            "id": 1273293,
            "lon": 77.1,
            "name": "National Capital Territory of Delhi",
            "score": 1,
            "countryGeoNameId": "1269750",
            "countryCode": "IN",
            "featureCode": "ADM1",
            "featureClass": "A",
            "stateCode": "07",
            "lat": 28.6667,
            "stateGeoNameId": "1273293",
            "population": 16787941
          }
        ],
        "countries": [
          {
            "id": 1269750,
            "lon": 79,
            "name": "Republic of India",
            "score": 1,
            "countryGeoNameId": "1269750",
            "countryCode": "IN",
            "featureCode": "PCLI",
            "featureClass": "A",
            "stateCode": "00",
            "lat": 22,
            "stateGeoNameId": "",
            "population": 1173108018
          }
        ]
      },
      "mentions": [
        {
          "id": 1261481,
          "lon": 77.22445,
          "source": {
            "charIndex": 40,
            "string": "New Delhi"
          },
          "name": "New Delhi",
          "countryGeoNameId": "1269750",
          "countryCode": "IN",
          "featureCode": "PPLC",
          "featureClass": "P",
          "stateCode": "07",
          "confidence": 1,
          "lat": 28.63576,
          "stateGeoNameId": "1273293",
          "population": 317797
        }
      ]
    },
    "people": [
      {
        "count": 1,
        "name": "Einstein"
      }
    ]
  },
  "status": "ok",
  "milliseconds": 80,
  "version": "2.3.0"
}

Limitations

Let’s start with the good part – while in preview, the app service plan required to host the web apps is at 50% price discount.

The major limitations right now are that you must have a dedicated app service plan for Linux web apps and that it must run in a resource group that does not include non-Linux web apps in the same region (As for the time of writing, the number of regions is also limited to West US, West Europe, and Southeast Asia).

Other limitations and be found on the overview page.

Bonus – Python client for CLIFF

The creators of CLIFF also created a client app in python for easy access to the API.
You can find the details on the package page.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s