Java: Unescaping HTML entities, or Don’t use libraries

Some time ago I needed to write some Android code which had to convert HTML entities to their string representations. In case you don’t know, HTML entities is the way of encoding charaters where you can write “&” as either “&” (by name) or “& #38;” (by ascii code). This is a well-documented and widely used conversion, and I figured it should already be available somewhere. So there’s Apache Commons Lang jar. Normally I don’t trust 3rd-party libraries, but it’s Apache. It just has to be super optimal and rock solid all around.

So I added this jar and started writing other code. Luckily, I was running performance tests on that. On some point I noticed that performance dropped very significantly. I started looking and found that, even though I barely had any entities in my input, Apache code really took a while to run.

The library is open source, so I looked into what it does. It was quite nice code, but also it did a lot of unnecessary stuff. Not the kind of stuff that handles rare cases – rather, the kind that makes it modular, easy to follow, and generally smart-looking. But not extremely effective. So I spent some time and tried to just rewrite the unescape function in a stupid way. Then I ran this test:

String s = "This is a test string & it has an entity";

Log.i("test", "start test 1");
long time1 = System.currentTimeMillis();

for (int i = 0; i < 10000; i++) {
    String s1 = org.apache.commons.lang3.StringEscapeUtils.unescapeHtml3(s);
}

Log.i("test", "start test 2");
long time2 = System.currentTimeMillis();

for (int i = 0; i < 10000; i++) {
    String s1 = StringUtils.unescapeHtml3(s);
}

Log.i("test", "end test 2");
long time3 = System.currentTimeMillis();
Log.i("test", "time 1: " + (time2 - time1) + " time 2: " + (time3 - time2));

Here are the results, they are rather interesting:

"This is a test string &amp; it has an entity"
time 1: 3421 time 2: 80
"This is a test string - it has no entity"
time 1: 3767 time 2: 5

Now, I’m not claiming my code is as good as Apache’s. My version may not handle some extremely rare corner cases. But so far it has been tested with very rapid calls in a large project, with multiple languages, and there has been no issues.

This is a great example of why I distrust 3rd-party libraries in general. It’s fine if they really do something serious. But free utility libraries doing small tasks are usually not worth it. They just make simple things look complicated and important. Take this entity conversion. It’s really nothing but a 2-screen function plus a data map. But what if you didn’t know that? You’d be stuck with a black box, which takes, literally, thousands of times more resources than it should, and you’d be thinking “that’s just the amount this task takes, nothing to be done about it”. Which is, I suspect, exactly the case with lots of people using this very common library.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: