Martin Probst's weblog

Java Performance and Garbage Collection

February 26, 2005 at 13:23 #

I've never really read a lot about Java performance tweaking which is something I'm going to change in the future. Partially because of a new job that will presumably require these skills, partially because it's just quite interesting to see what the guys at Sun and other companies did under the hood of the JVM.

There are two general areas, optimization of your code and tweaking of the JVM parameters. I found really interesting information about GC tweaking at Sun (they have a truly ugly stylesheet for that article, but Opera users can use the custom stylesheet ...).

More about memory management can be found in this article about Soft-, Weak- and PhantomReferences. I hadn't heared about these and the package java.lang.ref at all before. Looks very interesting especially for server applications that need caches.

So far about memory management. I'm still lacking information about performance programming dos and don'ts. Apart from general stuff programmers should know (virtual functions, allocation/deallocation, ...) I haven't read much about things in Java. Can anyone recommend me a good book about java performance programming?


Taking the XML out of XML?

February 13, 2005 at 12:10 #

You could use regular expressions Lars ;-)

But seriously, why would you want to do that? The major good things about XML are interoperability and that you don't need to flatten hierarchical data structures like in relational data storage. And you're proposing some sort of a flat query language on a tree-based data structure?

What good would that be? You're forcing developers who start to finally drop that "data flattening" attitude to walk back that abstraction again to write queries in a pattern matching language as if the input was a big string with special boundary symbols. Even if this had a positive effect on performance (which it probably wouldn't, it's basically the same as writing XPath expressions for a streaming processor) you would get another impedance mismatch. Writing such patterns would really be difficult, comparable to using SAX directly.

XPath and XPath 2 are languages designed for accessing trees and (at least for the simple things) very intuitive for someone who has worked with filesystem paths. They are good for exactly that. I don't see the point in querying something that is semantically a tree as if it was a flat line of event symbols using a pattern matching language.


Streaming XPath with SAX

February 12, 2005 at 13:03 #

Lars writes:

But I like SAX much more than DOM. It is faster, has reduced memory-consumption and I you have to think some seconds before you start to work, which leads to code that looks as if someone has thought about the problem before starting to write something.

It is obvious that a tree-based approach that is used by XPath cannot be used to wrap a SAX-like API. Perhaps STXPath, which is used for Streaming Transformations for XML (STX) might be a solution. This seems the be exacly the problem David Megginson already thinks of .

That's actually not quite true. You can process at least a subset of XPath using a SAX-alike parser. See here (XSQ) or here (O'Reilly: Streaming XPath) for a product and an overview with some pointers. There's also something going on in the .NET world (e.g. here). It's generally a subset because evil stuff like the ancestor axis is really painful to do in such systems, even though it's possible with large buffers. But the subset includes the important things like the child axis, generally every forward axis and you can even achieve predicates, again with buffers. I've read a paper about this though I lost the link to it, was quite interesting. The general idea was to visit XML tree elements (SAX events) only once but query tree elements a lot more often to see if they match.

Streaming XPath is IMHO the only solution to query big XML files if you don't have an XML database. Simple queries are even not necessarily slower, at least not slower than reading a big XML file into DOM and executing queries on that. It's probably sufficient for simple applications but for complex queries (or general queries, e.g. not a subset of XPath but all the axis and semantic sugar) it will probably be slower.

PS: reading more into the STX it seems like an effort to standardize streaming XPath processors. Especially the STXPath seems to be interesting. The specification defines some sort of a minimum context for processors to provide. Programmatically this example of a streaming XPath .NET API seems interesting. It shows how you can filter relevant nodes out of a stream and assign a handler to them. This seems to remove the need to implement awkward DOM navigation or SAX handlers (mind you, context aware ones!) but leaves the interpretation/handling of the XML to you own C# code. This should - for example - enable users to easily parse XML and create their internal data representation from it.


XML encoding problems

February 9, 2005 at 11:21 #

Attractive Nuisance contains a link to an interesting slideshow about XML encoding problems, starting from charsets going over to attribute order, whitespaces, entities, double escaping etc.

The slide titled "QNames" just reads "don't even get me started". If everyone understands that XML namespaces are broken, why doesn't anyone really do something about it? The W3C seems to have failed in this issue, both with ambigous URIs, difficult to make out QNames and a really strange way of namespace scoping.


Why do XML APIs suck?

February 9, 2005 at 00:32 #

The complexity of XML parsing APIs seems to be a general complaint about XML parsing APIs. So why do these APIs suck?

I've worked with the three major API styles myself (DOM, SAX, XML Push thingies) and yes, they do suck. But if you come to think about why they suck and how to make them better you'll find out it might be about your programming language. I've used these APIs with Java and C++ (or C) and it was unbelievably complex and hackish to navigate and recognize the XML structure. Even navigation to some element at the child axis takes a lot more code than it should. Creating XML is just a nightmare, just writing out pointy parentheses to a stringbuffer or the equivalent is way easier than using any API. But after all, how would you formulate that in Java (without just using an XPath library) in a better way?

I think the problem is using an imperative language with good support for single leveled structured data of statically known types to query/modify/whatever a data type which is strongly oriented on hierarchical, dynamic, ordered structures. To really manage this you would need a language that provides built-in support for lists, hierarchical navigation and a good approach to dynamic typing. Also it would need to be extensible to really mary the XML support with the language. So you could either go and create/use something like EAX or C立 (C-omega) or start with XQuery. XQuery sounds like a better candidate as stable engines with good typing support seem to be a lot less science fiction than the other languages.

Implementing something XML-ish starting of with writing a SAX consumer is IMHO just the wrong approach. It seems to be like implementing a GUI application starting of with raw drawing primitives and a user event queue. Those things have to be done, but they should be done ideally only once. In a slightly less ideal world it will be several times but at least not every application programmer.

Oh and yes, there will be a performance drawback with using an interpreted high-level language like XQuery. But if you really need that performance in the area where your application is dealing with XML you might be either one of the guys who really has to use DOM and SAX or your doing really strange stuff ;-)


Forced to prostitution?

February 4, 2005 at 01:58 #

While this story of a British Boulevard magazine has already been debunked as fake there might be something to add.

The actual reason behind this is that the government considered it impossible to abolish prostitution (which, arguably, has been proven by history several times). So to enhance living conditions for prostitutes they legalized their job. This enables prostitutes to sue clients for pay, get into various wellfare systems etc.

While it might be actually possible by current unemployment law to be urged to take such a job (as the taz points out) it's of course agains the constitution and common sense. And - as strange as it may sound - even in Germany not every law is being followed to the very point ;-)

[ via pubcrawler)


BlogBridge

February 4, 2005 at 00:30 #

BlogBridge is a Java-based WebStart enabled blog aggregator. The beautiful thing about it: you can give rankings to different feeds which defines the sort order (feeds with higher rankings are shown at the top). These rankings seem to be a distributed user commenting as it shows defaults it probably fetches from the internet. No privacy statement about that btw. It also has a user interface that shows articles as they appear on a webpage as opposed to in an eMail client, which is definitely positive.

The con is that it currently does not render fonts anti aliased and has no preferences whatsoever for font display. As it's Java it doesn't integrate with my Gnome settings too so font rendering sucks atm and this is crucial for a feedreader. Also setting the browser in the preferences doesn't work (at least not on Linux/Gnome/with Opera) so I can't open links in feeds.

Conclusion: promising UI ideas but not ready atm.


The Mac Mini sucks

February 3, 2005 at 23:49 #

For Lars: Mac Mini: The Emperors New Computer.

via Joi Ito


Trackback spam

February 1, 2005 at 10:52 #

Early this morning I got hit with a run of trackback spam - one trackback for every article I wrote. The spammers seem to have found wordpress blogs as many writers complain in the WP forums. Luckily the comments in my blog were very uniform so I was able to remove all spams with a single SQL statement ( DELETE FROM wp_comments WHERE comment_author_uri = '...' ). These spammers are really getting annoying.

While everyone is talking about identification needs on the web this sounds somewhat linked to getting rid of spammers. To defeat spammers we would need a technology that is able to cleanly identify whether something approaching you website is human or not. Turing tests with images containing numbers don't work as they break trackbacks completly.

The next problem is that the spammer might just use his real identification and change it afterwards. So the identification technology would need to map IDs to single persons in an irrevocable manner - no possibility for a spammer to register 5000 IDs. While this would fix spamming and maybe several online fraud related problems it obviously violates privacy so noone would really use it.

So what to do? Simple filters won't help for long, disabling trackbacks would suck, Turing tests don't work here too. Some kind of an adaptive bayesian filtering would be nice but this would mean a lot of work on weblogs. I think I'll start looking for the simple filters ...

Kitten's Spaminator seems to be a good start for filter plugins.


Squashed Philosophers

January 24, 2005 at 20:06 #

Squashed Philosophers via Joho the Blog. There are some guys missing, but anyway it's interesting.