Threads vs. Processes in Java & Co.

Friday, April 11, 2008 8:13 AM — 16 commentsEdit

Erik Engbrecht posted a very interesting article on Multiprocess versus Multithreaded System design. It includes a lot of insight, but just one quote:

One one hand on most operating systems threads involve less overhead than processes, so it is more efficient to use multiple threads than multiple processes. On the other hand multiple processes ultimately will give you better reliability because they can be spawned and killed independently from one another.

[...]

My rule of thumb is to look at the amount of shared data or messaging required between concurrent execution paths and balance against how long the "process" (not OS process) is expected to live.

I think one important point is missing here. It's the question of what your underlying framework is, and how you access it.

Early on, C-written CGI applications were an acceptable solution as the applications mostly used the underlying UNIX system as their framework and API. Spawning a simple process that more or less only accesses the basic C library in a short lived transaction is fast and safe.

But eventually applications get more and more complex, and you will have more and more shared functionality that is accessed in a library fashion. You will want to write this in the same language and environment as your application, as this makes debugging, lifecycle management etc. much easier.

As these libraries start to get more complex, their initialization will start to take significant time and resources. And at some point, you will no longer be able to load them every time a request hits your CGI program.

So now you can either put those into some kind of daemon that is accessed by IPC, or you will have to create a long running server process that handles multiple requests, either sequentially (FCGI) or even truly multithreaded like todays Java servers.

It's also not just the cost of initializing these libraries, it's also their footprint. As I've written before, running a herd of mongrels can get really expensive, memory wise.

MVM a solution?

That being said, losing the simplicity and reliability of processes is indeed highly problematic. I'm not sure about the state of Sun's Multi VM project, which was designed to solve some of these issues.

Maybe this thing could be overcome if our programming languages allowed us to define application parts that are effectively stateless, ambient libraries, executing more or less purely functional (well, except for logging and such...) once they are initialized.

If you could define such constraints on program parts, you could have your cake and eat it, too. Multiple instances of your application could share a possibly very large part of their code base, but you could still kill single instances at will, as they don't share state with other instances.

The problem is with the "more or less" part of "purely functional". I'm not sure if it's actually possible to find that niche where you can still have some state. And of course you would have to be ably to statically prove those constraints, or else everything is moot.

Also highly interesting in this context is Microsofts Singularity, which achieves a runtime that doesn't need processes but still has all the nice properties of processes (isolation, 'kill -9'), due to some carefully chosen, statically verifiable constraints.


Commenting is disabled.

Nice post! I just want to add something I recently came across: The webserver Yaws, written in Erlang:

http://yaws.hyber.org/

It specifically aims at solving these problems, ie. being able to scale up with many concurrent requests, because the language Erlang itself is designed for that. Most importantly, it doesn't use threads or processes of the underlying. I have no experience myself, but this article, showing that Yaws can handle 80000 concurrent HTTP requests, whereas the Apache webserver simply dies at around 4000, makes Yaws look very impressing:

http://www.sics.se/~joe/apachevsyaws.html

Maybe the definition of "stateless code parts" - as you request it - can be found in Erlang.


Thanks Alex.

I'm aware of Erlang and YAWS. And indeed, the "stateless code parts" pattern is very dominant in Erlang - Erlang is a purely functional language. State in Erlang happens mainly through tail recursion. Except for those parts, all code is by definition stateless and thus easy to share.

Erlang also doesn't have thread locking/synchronization, so you can always easily kill single actors.

So Erlang would indeed be a candidate, but the language itself is IMHO really bad. Nothing against functional programming, but useful programs do have state, and getting that state is really painful in Erlang. Plus it's totally antiquated, with strings being represented as integer lists and such...

I'd hope to port the benefits of certain Erlang properties & idioms to a more mainstream language and runtime like Java.


>I'd hope to port the benefits of certain >Erlang properties & idioms to a more >mainstream language and runtime like >Java. maybe Scala is the right language for those kind of experiments even if its not (yet) mainstream.


If you're in the Java world already, Terracota can provide some interesting possibilities to write in normal Java yet spread your work across processes or threads as you desire. Terracotta by itself doesn't handle the notion of spawning/killing processes but does support all of the shared memory / coordination aspects.

I've actually built a distributed testing framework that ties together Groovy scripted agents with Terracotta and some Java code to manage testing agents and so forth, which worked really well. (Soon to be open-sourced and publicized.)

Yes, I work for Terracotta.


@Alex: to me, Terracotta appears to take the way the other way around: instantiate the application software for each node (or process, the unit of failure anyways), but then distribute the instances of the domain model objects over a set of nodes. Which I think is very worthwhile, but still somehow the other way around ;-)

I should read up on Terracotta a bit, I'm interested in how you guys handle application failure (and also the whole thing seems to be very cool in general).

@mbien: I've been there. But Scala doesn't really solve the problems Erik's and my article were about. The actors implementation there is cool, but I'm still sceptical if Scala doesn't add too much complexity for the benefits it brings.