Maybe I Was Wrong about Java – Part 1
Editor’s Note: Being in a Java channel, most of us know the language very well and have been in its ecosystem for at least a couple of years. This gives us routine and expertise but it also induces a certain amount of tunnel vision. In a new series Outside-In Java non-Javaists will give us their perspective of our ecosystem.
I’m not the biggest fan of Java. I can read Java if necessary, and I’ve written some in emergency situations, but I don’t make a habit of it. I have the same stereotyped impressions of it as many other non-Java people: It’s big and slow and exclusively written by people who wear ties while programming.
I’m perhaps best known for an elaborate deconstruction of PHP, but that happened because PHP sparks a unique frustration in me. Java mostly inspires, ah, boredom.
But wait. PHP is popular because it has a singular advantage: It removes a few of the steps from the nightmarish ritual that is web development. Java doesn’t have a comparable massive advantage, so it must be doing something right. Otherwise Oracle wouldn’t brag about the nine trillion devices running Java. It wouldn’t power the guts of a notable number of big websites. The dominant smartphone OS wouldn’t be built entirely atop it. I wouldn’t have a toaster that runs a JVM. (It works better than you’d think! You just have to wait a bit for it to warm up.)
So maybe — maybe — I have Java all wrong. Maybe I’ve been unfair to it. Maybe.
It might be time to re-examine my preconceived notions about Java. To bust some myths, if you will.
For reference, I’ve dabbled in a number of languages, but I mostly do Python and like many of its design decisions. Python has a little philosophical overlap with Java, so it’s an interesting point of comparison.
Java Is Slow
Java definitely feels big and slow to me. The word is like molasses in my brain. When I think of Java, I’m reminded of the very few encounters I’ve had with it on the desktop. Perhaps most infamous was Azureus, the Java BitTorrent client that everyone used to download Ubuntu releases over a decade ago. It was surprisingly sluggish considering that all it did was download files, and everyone jumped ship to µTorrent when it came out. µTorrent is of course written in C++, which makes it faster and more performant.
The thing is, this doesn’t make any sense. The only BitTorrent client I’ve used in years is Deluge, which is written in Python. (C)Python is definitely not as fast as Java by any measure, yet Deluge has always been snappy for me. The JVM is used in part because it’s pretty fast, so whence came this association with slowness?
As much as we’d like to think otherwise, it’s tough to make any real measurements here; so many factors affect speed that even benchmarking a single program against itself is unreliable, let alone comparing two completely different pieces of software. Did the machine get bogged down doing something else? Did the GC kick in at an inopportune time? Is one language better at a particular task? Am I using a bad JVM? Is one app better-written than the other? Even if I could control all of these factors, it’s still tricky to measure the perception of speed, which is what this is really about. The best I can do is think this through.
Slow Programs, Slow Language
And I suspect a little bit of confirmation bias here. If a program is slow, we blame the program — but if a Java program is slow, we blame Java. Java’s even at a particular disadvantage here; it’s easy to notice that a GUI program is written in Java, since Swing sticks out like a sore thumb, but harder to be sure that it’s written in C++ or similar.
In my experience this is especially visible in gaming circles, where you can find deep insights on game performance from people who’ve never written a line of code. Minecraft is slow because it’s written in Java, you see, whereas Starbound is slow because the developers didn’t “optimize” it. Whatever that means.
Another possible factor, though I don’t know how true this is: I get the impression that many developers write C++ because they want to, but write Java when they have to? I’ve encountered a lot of C++ code that has no compelling reason to be written in C++, but a lot of Java code with a story that starts “we wrote this with Rails and ported it when it got too slow”. (Or “it’s for Android so we had to”.) In other words, Java seems to be a common go-to for speeding up code that’s already too slow, so any given Java program is reasonably likely to be doing some very heavy lifting. It’s not that Java is slow; it’s that slow programs tend to be written in Java. A problem of self-selection.
Meanwhile, much C++ is written purely out of spite. Or, rather, there’s a subset of programmers who are obsessed with maximum performance and will reach for C++ whether or not they need it, because being against the “bare metal” gives them warm fuzzies and is totally worth the risk of spurious segfaults and memory errors. So a given C++ program isn’t necessarily likely to be doing the kind of intensive work that would justify using C++.
Is The JIT At Fault?
I was originally tempted to blame some of the perceived slowness on JIT warmup, but I’m not sure that’s likely. I’ve started experiencing JIT warmup more frequently myself, as I’ve been using PyPy more. PyPy is a JITted Python implementation (written in itself, hence the name), and one of the big barriers to wider adoption is that it takes a few seconds to warm up. Python programs that run more or less instantly with stock Python cause a noticeable delay when run with PyPy — and since Python is commonly used for command-line tools, that delay impacts quite a few programs.
But for anything that takes more than half a minute or so to run, PyPy is significantly faster. I would expect a JVM to have a much more advanced JIT than PyPy’s relatively young one, so surely the warmup time would be even shorter. I doubt Java would get a reputation for being slow just because it dawdles a bit in the first twenty or thirty seconds.
I’ve also watched JIT warmup happen in realtime with LÖVE, a game engine that uses LuaJIT. If you draw the framerate on the screen, you can watch it start around 10 and ramp up to 60 over the course of several seconds. After that, it runs fine. Again, the slowdown is fairly brief, and it’s never given me the impression that the engine or JIT as a whole is slow.
Java As A Toy
I’m willing to bet that this is largely gossip dating back to when Java first came out. Surely some C++ developers would’ve perceived Java as a reinvention of C++ that was comically slow… relative to C++. The original Sun JVM didn’t have a JIT for several years, so very early Java was bytecode interpreted like most Python is now — but on much slower hardware.
It makes some sense from a cultural perspective, too. Java was presented as a “serious” language — i.e., a competitor to C++ — at a time when the only “serious” language was, er, C++. Anything that even smelled like an interpreted language was largely regarded as a toy. Even more so than now, I mean. If the entire spectrum of “languages anyone would use for serious development” ranged from C++ to Java, of course Java would look like a lumbering beast. Automatic garbage collection? What are you, some kind of infant? I bet your language can’t even segfault.
Java Is Bloated
I don’t like the word “bloat”. It doesn’t mean anything. Or, rather, it means “stuff I don’t care about”, which isn’t a useful criticism.
I once had someone tell me simultaneously that ⓐ they didn’t use GIMP for artwork because it’s bloated and ⓑ they were frustrated with the much simpler Paint.NET because it lacked an obscure power feature they needed… which GIMP has.
In the interest of precision, here are some things “bloat” might be intended to mean.
Java Programs Use a Lot Of Memory
The build system ran Closure as
java -Xmx1024M ..., which set the maximum heap size to 1 GB. When invoking Closure manually, I would sometimes forget the arcane-looking
-Xmx switch, and the JVM would instantly go down in flames. Later I discovered the somewhat obscure
_JAVA_OPTIONS environment variable, which saved me from ever having to remember this again, but the mystery remained. When I eventually looked into it, I found a fascinating clash of intentions.
We used shared dev machines, each with 64 GB. We had a
ulimit -v of 5 GB, which prevented any process from allocating any more than that. By default, our JVM tried to allocate a quarter of the machine’s physical RAM at startup — 16 GB. The allocation failed, so the JVM shut down immediately. (The documentation did suggest the heap should be capped at no more than 1 GB by default, but it looks to be incorrect.)
This isn’t an entirely unreasonable thing for the JVM to do. Linux will happily dole out large allocations it doesn’t actually have, and as long as processes don’t try to use all of that memory, everything works fine. The JVM took advantage of this to allocate a huge block upfront, avoiding the need for a lot of (slow) allocations later. The
ulimit was a rough heuristic to prevent runaway processes from bogging down the machine, but it limited a number that doesn’t actually mean anything. (Alas,
ulimit -m — which is supposed to limit actual memory usage — has been a no-op since Linux 2.4.)
In my head, this story is “that one time Java wanted 16 GB”. Unfair, I know, but it resonates with a vague sense that Java is generally known as a memory hog. Even Java itself seems to think it might need that much memory. Yet if I look past the stereotype, I can’t come up with a concrete offhand reason why that would be the case.
I can imagine that most of the speed factors apply here as well: Java gets blamed when a Java program is big, but C++ doesn’t get blamed when a C++ program is big; big tasks tend to end up written in Java; C++ developers who hyperoptimize may scoff at garbage collection and per-object overhead.
Ah, but while speed is difficult to quantify even for a single program, size is much more deterministic. The same value of the same type will generally take the same amount of space from run to run. I know that a CPython object has 16 bytes of overhead, for example — a type pointer and a refcount. I’m less confident about object overhead in Java — exact memory usage is treated as an implementation detail of the VM, and Java has several competing VMs in common use. (I only know about CPython from reading the source code, which I’m already familiar with.) But I seriously doubt any JVM would need more object overhead than Python, and in some cases it should need much less — Python has no unboxed primitives. Ultimately, Java shouldn’t be much worse than Python, and I don’t think of Python as a memory hog, so surely I shouldn’t think of Java as one.
Enough speculation. If size is more reliably measurable, let’s do some measuring. Here’s a terrible, unscientific comparison of the initial memory usage of a few arbitrarily-chosen desktop programs.
- XMind (Java), a mind mapper: 416 MB
- yEd (Java), a graph editor (as in flowcharts, not y = x²): 372 MB
- jDiskReport (Java), a disk space report thing: 228 MB
- muCommander (Java), an MC-style file browser: 183 MB
- FreeMind (Java), another mind mapper: 181 MB
- SLADE (C++/wxWidgets), a Doom map editor: 93 MB
- GIMP (C++/GTK), an image editor: 88 MB
- LMMS (C++/Qt), a digital audio workstation: 75 MB
- Deluge (Python/GTK), a BitTorrent client: 85 MB
(Incidentally, all of the Java programs start with at least 11 GB virtual — except FreeMind, which comes with a shell script that adds
-Xmx. Even a trivial hello-world Java program allocates 10.5 GB for me, so it looks like the default initial heap size is still comically large. I have 32 GB of physical memory, so a quarter would be 8 GB; I don’t know where values of 10.5+ GB are coming from.)
Wow. I didn’t expect such a vast difference, even knowing that a lot of factors are working against Java here. GTK and Qt are shared libraries that might be shared with other processes, and even a Python UI is wrapping those libraries and storing much of its data at the C level; Swing lives almost entirely in Java land and was only in use by the program I had open. Java of course has a whole runtime (which seems to take a baseline of 26 MB), whereas the C++ “runtime” is microscopic. Python has a runtime as well, but Deluge is a much simpler program than the others; it’s just the only other not-C++ thing I have on hand. Or XMind and yEd might just be unrepresentative for some other reason.
Java (like Python) also needs to remember a lot of debugging information that’s generally left out of released C++ programs, for the sake of reflection and stack traces. Garbage collection tends to trade off spare memory for speed; I don’t know in detail how my JVM’s GC works, but I’ve seen GCs with 100% overhead or more, so I wouldn’t be surprised if that were a significant factor here. I want to say GC also has issues with fragmentation that are difficult to solve.
Java’s Strings Are Wasteful
I suspect another huge culprit is Java’s
Until Unicode went “oops” and promised that it would never have more than 1,114,111 characters, so three bytes would be enough to represent anything, but no one has a three-byte integer type, so let’s round it up to four. Now Java is in a very awkward position. ASCII text takes twice as much space as it actually needs. But a Java
String can’t represent all Unicode characters without another encoding mechanism, so it’s still just as possible to write buggy text-handling code that assumes each character is an actual character. The worst of both worlds.
Strings could be a significant chunk of a GUI’s memory usage, what with the plethora of labels and tooltips, in which case this would be making a huge difference. I know XML and configuration files are fairly popular in the world of Java, too, and those would naturally produce tons of strings. I see that Java 9 is set to adopt a more compact scheme, where ASCII strings will transparently use only one byte per character under the hood. Python switched to a similar scheme several releases ago, and the memory usage of a Django (web) application dropped by a whopping 40% relative to a two-byte build. Java might see comparable improvements in text-heavy programs; a report on the impact on a server benchmark found a 21% improvement in heap usage.
I went into this expecting to find that Java was far more compact than I thought, and… that didn’t quite happen. Even 228 MB isn’t too bad, but 416 MB is a little ridiculous. Oh, I accidentally left yEd running overnight; it grew to 854 MB despite doing nothing, putting it in third place behind an art program with several large canvases open and my browser with several hundred tabs open. I don’t know what it’s doing or who to blame, but this doesn’t look good. It is interesting that FreeMind and XMind were the best and worst, respectively, despite being the same genre of software and written in the same language.
Now, to be a little fair to Java, I honestly wouldn’t have noticed the memory usage if I hadn’t gone looking for it. I do have 32 GB of RAM for a reason. And yet… my entirely Java-based phone barely has 2 GB, and it seems to manage. Maybe I’m missing something here. The spread for Java programs is conspicuously much wider than for C++ programs, too, and I don’t know why.
Java Programs Have a Large Filesystem Footprint
I feel myself nodding slightly in noncommittal agreement with this, while I look around the room to see if anyone else thinks it’s true.
The thing is, I have no idea why this sounds reasonable to me.
The Arch Linux package repository is convenient here, since it lists the installed size for every package. My old friend Closure Compiler is 19.7 MB. Wow! That’s a lot. Right? Well, maybe not. The most similar project I can think of is Babel, which compiles the latest ECMAScript to something more browser-friendly, and it’s still 7.0 MB. Closure Compiler can do much of what Babel does, plus it’s built more like a traditional compiler, so I’m not surprised it’s bigger — if anything, I’m surprised it’s only twice-ish as big. GCC, meanwhile, is 116.1 MB.
This is a wildly unreliable and unscientific way to measure the size of software. But it’s enough that I’m already having serious doubts about this point. It’s tempting to compare some of the desktop software from earlier, but even FreeMind and XMind — two Java programs that serve roughly the same purpose — have very different installed sizes of 27 MB and 138 MB, respectively.
I can’t even think of an anecdote that would explain why I’d expect Java software to be big. I wonder if it’s because Java applets were relatively large back when they were popular — when a lot of people were still using dial-up.
Before I stop picking on Closure Compiler, I do have one other interpretation of “filesystem footprint” — the arrangement of source files. The bulk of Closure’s source code lives in a directory five levels deep, which seems slightly extreme. That directory contains 347
.java files, which also seems slightly extreme. A lot of these files look like they could be grouped together into sensible categories, but they… aren’t.
Maybe Closure is an exception. GitHub tells me the most-starred Java project is elasticsearch, the actual code of which is also buried five levels down. I don’t see any directories with quite so many files here, but each directory seems to contain ten more, and I’ve yet to find a bottom.
Browsing the source code for either of these (admittedly large) projects seems daunting. There’s a certain amount of sprawl here. Most Python projects, in contrast, are relatively compact: a few levels deep, half a dozen files wide.
I suspect this boils down to a very significant property of Java: that a single file can only contain a single (public) class. As I understand it, Java has no notion of exports; if you have a file
foo/Bar.class, the only thing in it that can be imported is a public class called
Bar. No matter how you want to arrange your code, how big or small your classes are or how many of them you have, you need a separate file for every single public class. The rules are different, but this reminds me a bit of Perl, in a way. Wow, that’s a weird thing to say.
Contrast this with Python, which has a fairly similar-looking package hierarchy. The file
foo/bar.py can be imported as the dotted path
foo.bar, and its contents can be referred to as
foo.bar will be an intermediate namespace, a module object, which contains everything defined in the file. The imports look similar to Java’s, but the files are freeform. A single file can contain a lot of small classes, free functions, public plumbing, and whatever else you want. Python files can contain entire ideas; Java files are shackled to whatever you think a “class” ought to be.
I have a hard time imagining working on a large project with this one-class-per-file limitation. Creating a new class (rather than tacking onto an existing one) already involves some friction; needing to create a whole new file would be a huge disincentive. I’d probably want to have as few classes as possible, which might be why I faintly associate Java with big hairy classes. More stuff per class means fewer classes overall — problem solved!
Curiously, Java does have a way to emulate Python-like modules, but I’ve never seen or heard of its being used.
import static allows directly importing nested classes, so a file could contain a big public class with any number of static public classes inside it. The top-level class would be a dumb container, similar to a Python module, and any combination of its static children could be imported by any other code. Alas, if this is as uncommon as it seems to be, the resulting confusion among other Java developers would probably negate the benefits.
Looking at elasticsearch again, the package structure isn’t too different from how I might arrange a Python project. Consider this
analysis directory, where many of the files are boilerplate classes no longer than the license at the top of the file. This is a single package with a lot of small classes; they’re only spread across multiple files because the language requires it. Interpreted that way, elasticsearch is a fairly tidy project. I even found a few directories with only a single file/class in them, which is exactly what I’d end up with if I emulated Python modules with package directories.
The file sprawl seems slightly annoying from a development perspective, but it’s intertwined with a fairly fundamental law of the language. Sorry, Java; you lose this round.
Java Has a Large Standard Library
Is this an actual complaint anyone makes? I’m not sure. It parallels the use of “bloated” for end user software, so it’s worth a look.
Standard libraries are a tricky tradeoff. C’s paltry standard library means that many projects reinvent the same wheels, but keeps the language itself simpler and more portable. On the other hand, Python’s massive standard library is very convenient, but has accumulated loads of weird junk like over a dozen IRIX modules. I prefer a standard library that augments the core language well, though I’m increasingly finding that a solid package manager can be much nicer than having the kitchen sink shipped with the language itself.
I can’t think of any good way to measure the size of a standard library besides size on disk. Here are the sizes for some language runtimes I happen to have installed on my machine, as well as the zsh globs I used to measure them. My gut feeling is that Java, .NET, and Python have fairly large standard libraries; Rust and Ruby have more moderate ones; and Perl is fairly bare-bones. C is nothing.
- Mono 4.6: 43 MB —
- OpenJDK 7: 33 MB —
- CPython 3.5: 27 MB —
- Perl 5.24: 17 MB —
- Ruby 2.3: 8.9 MB —
- Rust 1.12: 4.9 MB —
- C: 2.0 MB —
Well, er, hm. Color me slightly surprised.
Admittedly, this is a terrible comparison. Rust’s standard library is compiled right down to machine code (making it smaller). Mono’s and OpenJDK’s are, I assume, compiled to bytecode. The CPython, Perl, and Ruby libraries are uncompressed source code (making them larger), but exclude a few binary components (making them smaller), but include a great deal of inline documentation (making them larger), but are written in higher-level languages (making them smaller).
The position of Perl makes me doubt this methodology all the more. All I can really conclude is that Java’s standard library is somewhat bigger than Python’s, which seems right — Java ships with twice as many GUI packages as Python, for example. I don’t see any strong signs here that Java’s standard library is unmanageably large.
The Score So Far
On the other hand, Java may be more memory-hungry than I expected. I don’t have a satisfying explanation for why this is the case or why it seems so much worse on my desktop than my phone. It’s possible that a couple of my scant few data points are just bad. It’s also possible that Java really is a memory hog, that’s why not much public desktop software is written in Java, and I never notice on my phone because I don’t multitask nearly as much.
I’m a programmer, but I’m not a Java programmer, so most of this has been from an end user perspective — where several other factors contribute to an unfair negative perception of Java. Java only reminds desktop users of itself when it’s doing something annoying, like running a quickstart service on startup (is that still a thing?) or bugging about updates. I mentioned briefly that Swing sticks out like a sore thumb, and the programs I tried have cemented that impression: They use the wrong colors and fonts, they default to Java’s own subpixel font rendering which bleeds badly, and they do bizarre things like emulate the Windows 95 file chooser dialog on Linux. I groan every time I encounter a Swing program, because I know it’ll be awkward and clunky — and since the program is so obviously written in Java, I associate “clunky” with Java as a whole.
But Java has a small and shrinking presence on the desktop and is disappearing from browsers, so perhaps the end user perspective doesn’t matter so much. Java is still preposterously popular server-side and almost exclusively powers Android, and there must be a good reason for that. I remain skeptical but curious. I’ll have to look more into what Java is like from a development standpoint in part two.