Martin Probst's weblog

Why do XML APIs suck?

Wednesday, February 9, 2005 12:32 AM — 3 commentsEdit

The complexity of XML parsing APIs seems to be a general complaint about XML parsing APIs. So why do these APIs suck?

I've worked with the three major API styles myself (DOM, SAX, XML Push thingies) and yes, they do suck. But if you come to think about why they suck and how to make them better you'll find out it might be about your programming language. I've used these APIs with Java and C++ (or C) and it was unbelievably complex and hackish to navigate and recognize the XML structure. Even navigation to some element at the child axis takes a lot more code than it should. Creating XML is just a nightmare, just writing out pointy parentheses to a stringbuffer or the equivalent is way easier than using any API. But after all, how would you formulate that in Java (without just using an XPath library) in a better way?

I think the problem is using an imperative language with good support for single leveled structured data of statically known types to query/modify/whatever a data type which is strongly oriented on hierarchical, dynamic, ordered structures. To really manage this you would need a language that provides built-in support for lists, hierarchical navigation and a good approach to dynamic typing. Also it would need to be extensible to really mary the XML support with the language. So you could either go and create/use something like EAX or CΩ (C-omega) or start with XQuery. XQuery sounds like a better candidate as stable engines with good typing support seem to be a lot less science fiction than the other languages.

Implementing something XML-ish starting of with writing a SAX consumer is IMHO just the wrong approach. It seems to be like implementing a GUI application starting of with raw drawing primitives and a user event queue. Those things have to be done, but they should be done ideally only once. In a slightly less ideal world it will be several times but at least not every application programmer.

Oh and yes, there will be a performance drawback with using an interpreted high-level language like XQuery. But if you really need that performance in the area where your application is dealing with XML you might be either one of the guys who really has to use DOM and SAX or your doing really strange stuff ;-)

Forced to prostitution?

Friday, February 4, 2005 1:58 AM — 0 commentsEdit

While this story of a British Boulevard magazine has already been debunked as fake there might be something to add.

The actual reason behind this is that the government considered it impossible to abolish prostitution (which, arguably, has been proven by history several times). So to enhance living conditions for prostitutes they legalized their job. This enables prostitutes to sue clients for pay, get into various wellfare systems etc.

While it might be actually possible by current unemployment law to be urged to take such a job (as the taz points out) it's of course agains the constitution and common sense. And - as strange as it may sound - even in Germany not every law is being followed to the very point ;-)

[ via pubcrawler)

BlogBridge

Friday, February 4, 2005 12:30 AM — 0 commentsEdit

BlogBridge is a Java-based WebStart enabled blog aggregator. The beautiful thing about it: you can give rankings to different feeds which defines the sort order (feeds with higher rankings are shown at the top). These rankings seem to be a distributed user commenting as it shows defaults it probably fetches from the internet. No privacy statement about that btw. It also has a user interface that shows articles as they appear on a webpage as opposed to in an eMail client, which is definitely positive.

The con is that it currently does not render fonts anti aliased and has no preferences whatsoever for font display. As it's Java it doesn't integrate with my Gnome settings too so font rendering sucks atm and this is crucial for a feedreader. Also setting the browser in the preferences doesn't work (at least not on Linux/Gnome/with Opera) so I can't open links in feeds.

Conclusion: promising UI ideas but not ready atm.

The Mac Mini sucks

Thursday, February 3, 2005 11:49 PM — 2 commentsEdit

For Lars: Mac Mini: The Emperors New Computer.

via Joi Ito

Trackback spam

Tuesday, February 1, 2005 10:52 AM — 22 commentsEdit

Early this morning I got hit with a run of trackback spam - one trackback for every article I wrote. The spammers seem to have found wordpress blogs as many writers complain in the WP forums. Luckily the comments in my blog were very uniform so I was able to remove all spams with a single SQL statement ( DELETE FROM wp_comments WHERE comment_author_uri = '...' ). These spammers are really getting annoying.

While everyone is talking about identification needs on the web this sounds somewhat linked to getting rid of spammers. To defeat spammers we would need a technology that is able to cleanly identify whether something approaching you website is human or not. Turing tests with images containing numbers don't work as they break trackbacks completly.

The next problem is that the spammer might just use his real identification and change it afterwards. So the identification technology would need to map IDs to single persons in an irrevocable manner - no possibility for a spammer to register 5000 IDs. While this would fix spamming and maybe several online fraud related problems it obviously violates privacy so noone would really use it.

So what to do? Simple filters won't help for long, disabling trackbacks would suck, Turing tests don't work here too. Some kind of an adaptive bayesian filtering would be nice but this would mean a lot of work on weblogs. I think I'll start looking for the simple filters ...

Kitten's Spaminator seems to be a good start for filter plugins.

Squashed Philosophers

Monday, January 24, 2005 8:06 PM — 0 commentsEdit

Squashed Philosophers via Joho the Blog. There are some guys missing, but anyway it's interesting.

Prevayler

Sunday, January 9, 2005 3:05 PM — 0 commentsEdit

I stumbled across a persitance framework for Java called "Prevayler. It's a rather interesting approach to persistence.

The basic idea is to keep all information within RAM within usual Java objects. These objects are persisted using some arbitrary Java serialization technology - implementing "Serializable" will do. All operations on the objects are done using objects representing transactions. These transaction-objects are passed through the system and serialized to a log. If something crashes the application the serialized transactions are replayed from an initial state of the system. To keep the logs small the system saves the current state to disk (using serialization) in regular intervals, like once a day. This is done by keeping a hot standby server which is running synchronized to the real server. If a backup is requested the standby server stops syching with the server, dumps his objects and resynchs.

This is a very smart way of easily achieving quite a good persitancy for objects without much hassle. While it is nice, it has a lot of limitations. The developers themselves seem to think they have found a universal solution to overcome relational databases in general. Does it really do that?

With Prevayler programmers have to be quite smart to really get their objects to be persistent - while they are completly unlimited regarding the use of Java features they have to pay attention on some certain things like not keeping references to objects. Also, the speed of the system relies on the ability of the programmer to create a sensible data structure to access her data - if she chooses the wrong data structure things will get really slow. SQL and RDBMS have been invented to solve these problems, and they are quite good at it, even though the conversion from the data represented in the relational world back to your business application takes some (mostly trivial) effort.

Next thing is Prevayler doesn't do anything in parallel. No transaction may be run parallel to another if the user doesn't synchronize it himself. This is a major con. Programming concurrent applications isn't trivial and one of the biggest benefits of RDBMS is to provide concurrency control, different levels of transactional security etc.

Also Prevayler won't help you with big amounts of data. The authors claim that falling RAM prices will solve that problem but that does not seem plausible. Real business applications can have several hundred GB of "live" data - thats not an area you will reach with cheap RAM anytime soon. These amounts of data can be managed with a (rather) cheap x86 computer and big harddrives, even if it might get slow. Prevayler simply fails.

I think Prevayler is a smart idea to solve a limited problem - persistance of data in small applications. If they can add intelligent means of swapping objects to disk to it it might get really useful. But this would again put limitations on the programmer, like forcing him to inherit from special objects, implementing certain interfaces etc.

The end of the story is that database management systems, relational or not, provide a lot of features Prevayler doesn't give you. Prevayler is suited for applications that are written by really good programmers, won't produce too much data and don't require any concurrency.

I can't really see how the programmers of Prevayler come to the conclusion that they obsoleted DBMS - and why do they think that thousands of capable programmers and scientists have just overlooked the Prevayler approach? That seems quite arrogant to me.

Joel on the HPI

Sunday, January 9, 2005 11:41 AM — 2 commentsEdit

Joel on Software writes this:

The moral of the story is that computer science is not the same as software development. If you're really really lucky, your school might have a decent software development curriculum, although, they might not, because elite schools think that teaching practical skills is better left to the technical-vocational institutes and the prison rehabilitation programs. You can learn mere programming anywhere. We are Yale University, and we Mold Future World Leaders. You think your $160,000 tuition entititles you to learn about while loops? What do you think this is, some fly-by-night Java seminar at the Airport Marriott? Pshaw.

The trouble is, we don't really have professional schools in software development, so if you want to be a programmer, you probably majored in Computer Science. Which is a fine subject to major in, but it's a different subject than software development.

Which basically sounds like the very idea of my University, the Hasso-Plattner-Institut for Software Systems Engineering. I wonder how long it will take until this is insight leads to a large scale change in our CS education systems. It's not really new and everyone who has finished University in CS and starts to work knows it - when will the Universities start to do something about it?

Project management under GNOME

Saturday, December 18, 2004 12:09 AM — 0 commentsEdit

I just gave Imendio Planner a try. It's a simple (compared to MS Project) yet very useful project planning application. With Planner you can easily define ressources and tasks and compile them in a Gantt chart. The most important features are available such as four types of end-start conditions within the Gantt chart, assigning ressources to tasks, sub-tasks and more.

The tool has a nice simple GNOME GUI and seems to be a lot easier to use than MS Project. It lacks the (IMHO) important feature of displaying your use of ressources which is a pity. While it exports to HTML and prints nicely an option to export the Gantt chart to some graphics format might be helpful too.

Anyway I would use this in upcoming projects as its very simple to use, free, and fits into my usual development environment.

C++ builds the easy way with scons

Monday, December 6, 2004 7:50 PM — 0 commentsEdit

One of the minor but still annyoing pitfalls of development with C++ are Makefiles. The syntax is rather cryptic, if dependencies are getting bigger large Makefiles have to be maintained and more complex tasks require really dirty hacks.

There are a lot of make replacements out there. Today I took a look at SCons which looks really nice. It's written in Python and does not invent a new syntax but rather uses Python as the language to write build files in. Build files are declarative using calls to functions SCons provides to tell the system which targets have to be built. After executing the build script the targets are made using a set of implicit rules.

The major pros of SCons are the smart helper functions. You don't have to define dependencies between source files - SCons takes care of that by scanning the files itself (supporting quite a nice set of languages already). Implicit rules are available for compiling executables, libraries (shared and static) and some other files. The developers claim it should be easily extendible (maybe I'll try with antlr when I get some spare time). SCons doesn't just look at file modification times but uses md5 hashes by default, which avoides the whole mess applications create when touching files accidently. Also SCons keeps track of the state of intermediate files - a change in a source file that doesn't lead to a change in the object file won't lead to re-linking libraries or executables. Because SCons does not recurse into nested directories (it rather "includes" sub-build files) it should also be quite good with multiple build jobs and/or distributed compiling - recursive makefiles are a major obstacle for this as the make execution only sees a few source files at a time.

The biggest pro is probably also a con - using Python as the Makefile language. This enables users to easily manage complex build problems using a real programming language. On the other hand it enables people to create really cryptic build files as the syntax does not have any concept of order, grouping etc. It should be possible to overcome this by employing templates, coding standards etc. but it adds another thing to control and manage.

Another con is that a POSIX compliant make should be available nearly anywhere while SCons would be another dependency. However if you distribute binary packages anyway this shouldn't be that important.

The pros seem to overweigh the cons, at least for me. I think I'll use it in future smaller C/C++ projects, if it's evil despite the good impression I'll find out all too soon I guess ...