Martin Probst's weblog

Finally a PDF reader for GNOME-Linux?

March 15, 2005, 4:43 p.m. — 0 comments

I just tried out Evince, a PDF reader. Displaying a PDF doesn't sound like a big thing but actually this has been one of the minor annyoances on my GNOME desktop for quite some time. Xpdf, ggv, gpdf, etc. are either badly integrated with GNOME, have a really strange user interface, tend to display PDFs wrong or don't support basic features like searching.

Adobe Acrobat Reader does not seem to be a real alternative. I'll try out the 7.0 version soon but the 5.0 just plain sucks. Version 7 promises GNOME integration but according to Luis Villa's review it doesn't really succeed on that.

Evince seems to fix that. I can't really say much about it's capability to display PDFs correctly (only had a few samples which worked) but integration and user interface seem to be ok. It also supports searching. Evince still lacks some stuff though: multi-page scrolling, the grab-cursor mode for draging the document view, possibly a zoom tool that lets you specify areas to display. But this is (as far as I can see) on their TODO list and might be integrated soon.

Namespaces

March 14, 2005, 1:08 p.m. — 1 comment

Every now and then someone thinks about namespaces and nearly everyone seems to be at least slightly confused afterwards. David Magginson tries to clarify on that, although he admits the decision the Namespace group has taken is not really perfect.

Now what really bugs me is that according to his post (and Dare Obasanjo has written about that before too) there are three cases for the namespace of an attribute:

The annyoing thing about that is in my opinion the "locally scoped" case. I might miss something, but I don't see any other XML standard that really requires or uses that "locally scoped" feature. At least XPath, XQuery etc. don't use it, and XML Schema as far as I know doesn't too, does it?

The namespace should have been either inherited from the element the attribute is in or (in my opinion clearer) been the default namespace set with <myelem xmlns="foo"/>. There should really not be any difference between elements and attributes regarding XML namespaces as both use the same syntax (QNames). Using the inherit-mechanism not only for attributes but for elements also might bring strange effects when including XML snippets within other documents.

Another solution would be to forbid elements without explicit namespace prefixes altogether. This would bring some annyoance to users who do not use XML namespaces or just use one within one document, but it would also be absolutely clear. I would prefer to have a clear syntax instead of surprising effects when using namespaces ...

XSLT and XQuery application domains

March 11, 2005, 4:15 p.m. — 0 comments

Once again someone published an article comparing XQuery and XSLT. As others have mentioned (here or here), this article isn't really that helpful. In fact, it's actually misleading in several places. The author compares XSL 1.0 with XQuery 1.0 where XSL 2.0 would really be the one to pick. Also the author describes how to extend XSL or XQuery processors giving code samples which are tailored for two specific implementations. I'm not really sure how that is supposed to be helpful as the mechanism for extension is bound to be vendor specific and can be very different from implementation to implementation.

My main criticism of the article is that it once again mixes up application domains of the two languages. You cannot do a direct comparison of XQuery and XSL, they have been created for two very different purposes. The only thing similar is that they both work on XML. Think about it, the W3C wouldn't invent two languages for exactly the same purpose, would it?

XSL is the eXtensible Stylesheet Language. An XSLT is a Stylesheet Transformation, e.g. it's supposed to take an input document and apply some kind of a style to it by converting it's contents to something different.

XQuery is the XML Query Language. It's supposed to be used for querying XML data sources. This means: take several input sources, fetch information from them (e.g. by matching certain criteria against the sources), and return that data. The XML element constructors allowed in that language are not thought to be used to re-style document contents, but rather to give the user a means to structurize his returned content. Do not use XQuery to re-style documents, you will probably end up with lengthy, complicated queries requiring "manual recursion" (as opposed to XSLT's automatic recursion with "apply-templates"), endless typeswitchs and an ugly mix between presentational and application logic. Look at the XQuery use cases. None of the queries tries to convert documents.

A typical application might use XQuery to fetch XML from an XML database or other sources (like the filesystem or web sources - whatever). The XML would just be taken from it's source, maybe structured by some tags and then passed on to a presentational part of the application where it might be styled using XSLT.

The XSLT standard arguably expects a document (and usually exactly one document) to be fully available in memory (it doesn't really require that, but all scripts and implementations I've seen actually work like that). XQuery doesn't need that, it has been designed with large data stores in mind from which you might only want to extract minimal parts. XQuery has been designed to be able to query large data stores as opposed to XSLT which has been designed to format/re-style XML documents of a size that actually fits into main memory.

In a 3-tier model (introduced by SAP back in the 80's?) you would typically find XQuery statements in the data-server-tier (as stored queries) or in the application-server-tier. XSLT scripts would be found either in the application-server-tier or, since most browsers support XSLT nowadays, in the client-tier.

MarkLogic's use of XQuery as a CGI language is quite an interesting example of using XQuery in the application-tier though in the screencast presentation we can once again see people trying to transform XML documents to XHTML using XQuery. A better example might have been aggregating information from the book database (e.g. all authors and how many books they've written) and transforming that information into something displayable by the client using XSLT. Apart from that it's quite nice btw.

XML file types

March 6, 2005, 12:16 p.m. — 1 comment

Just a quick note: why does every tool that uses XML files store them in files ending in ".xml"? This is really getting annoying. If you are using XML files in various different applications neither Windows nor Linux can provide the correct application when double-clicking them (I don't know about MacOS, the have MIME types associated with files, don't they?). You might have "mydoocbook.xml", "build.xml", "project.xml" etc.

This is especially striking when working with Eclipse. Every XML related plugin seems to consider it valid to conquer the ".xml" ending. So double-clicking an XML file in Eclipse most likely opens the wrong editor/view/whatever. Application programmers should really consider doing it like StarOffice. Provide a default filename ending for your XML application and use it!

On the other hand most current filesystems provide the ability to handle meta-data like MIME types. NTFS does, ReiserFS does, Ext3 does, XFS, JFS, etc. This has been around for quite some time so someone (Gnome?) should take the first step and use it.

Trackback Spam & Spam Karma

Feb. 28, 2005, 3:53 p.m. — 0 comments

While Lars is a little bit annoyed over my recent WordPress upgrade to Version 1.5 I made a great step forward regarding spam fighting.

Spam Karma now actually works. Which is really great as I used to get about 20-30 spam comments/trackbacks every week, sometimes even more frequent. Before I tried Spam Karma I just disabled the comments form but soon thereafter they started to do fake trackbacks. Spam Karma can handle both comments and trackbacks and since it started working (with WordPress 1.5) I haven't received a single spam, I didn't even have to moderate much.

The con is Spam Karma only works that good because the spammers are really simple at the moment. They don't even try to maskerade in any way. The good thing is: fake characters (e.g. "1" instead of lowercase-"L") won't help them. They just do this to get goodGoogle ratings but what good is a perfect Google rating on "texa5 ho1dem"? :-D

Java Performance and Garbage Collection

Feb. 26, 2005, 1:23 p.m. — 0 comments

I've never really read a lot about Java performance tweaking which is something I'm going to change in the future. Partially because of a new job that will presumably require these skills, partially because it's just quite interesting to see what the guys at Sun and other companies did under the hood of the JVM.

There are two general areas, optimization of your code and tweaking of the JVM parameters. I found really interesting information about GC tweaking at Sun (they have a truly ugly stylesheet for that article, but Opera users can use the custom stylesheet ...).

More about memory management can be found in this article about Soft-, Weak- and PhantomReferences. I hadn't heared about these and the package java.lang.ref at all before. Looks very interesting especially for server applications that need caches.

So far about memory management. I'm still lacking information about performance programming dos and don'ts. Apart from general stuff programmers should know (virtual functions, allocation/deallocation, ...) I haven't read much about things in Java. Can anyone recommend me a good book about java performance programming?

Taking the XML out of XML?

Feb. 13, 2005, 12:10 p.m. — 1 comment

You could use regular expressions Lars ;-)

But seriously, why would you want to do that? The major good things about XML are interoperability and that you don't need to flatten hierarchical data structures like in relational data storage. And you're proposing some sort of a flat query language on a tree-based data structure?

What good would that be? You're forcing developers who start to finally drop that "data flattening" attitude to walk back that abstraction again to write queries in a pattern matching language as if the input was a big string with special boundary symbols. Even if this had a positive effect on performance (which it probably wouldn't, it's basically the same as writing XPath expressions for a streaming processor) you would get another impedance mismatch. Writing such patterns would really be difficult, comparable to using SAX directly.

XPath and XPath 2 are languages designed for accessing trees and (at least for the simple things) very intuitive for someone who has worked with filesystem paths. They are good for exactly that. I don't see the point in querying something that is semantically a tree as if it was a flat line of event symbols using a pattern matching language.

Streaming XPath with SAX

Feb. 12, 2005, 1:03 p.m. — 1 comment

Lars writes:

But I like SAX much more than DOM. It is faster, has reduced memory-consumption and I you have to think some seconds before you start to work, which leads to code that looks as if someone has thought about the problem before starting to write something.

It is obvious that a tree-based approach that is used by XPath cannot be used to wrap a SAX-like API. Perhaps STXPath, which is used for Streaming Transformations for XML (STX) might be a solution. This seems the be exacly the problem David Megginson already thinks of .

That's actually not quite true. You can process at least a subset of XPath using a SAX-alike parser. See here (XSQ) or here (O'Reilly: Streaming XPath) for a product and an overview with some pointers. There's also something going on in the .NET world (e.g. here). It's generally a subset because evil stuff like the ancestor axis is really painful to do in such systems, even though it's possible with large buffers. But the subset includes the important things like the child axis, generally every forward axis and you can even achieve predicates, again with buffers. I've read a paper about this though I lost the link to it, was quite interesting. The general idea was to visit XML tree elements (SAX events) only once but query tree elements a lot more often to see if they match.

Streaming XPath is IMHO the only solution to query big XML files if you don't have an XML database. Simple queries are even not necessarily slower, at least not slower than reading a big XML file into DOM and executing queries on that. It's probably sufficient for simple applications but for complex queries (or general queries, e.g. not a subset of XPath but all the axis and semantic sugar) it will probably be slower.

PS: reading more into the STX it seems like an effort to standardize streaming XPath processors. Especially the STXPath seems to be interesting. The specification defines some sort of a minimum context for processors to provide. Programmatically this example of a streaming XPath .NET API seems interesting. It shows how you can filter relevant nodes out of a stream and assign a handler to them. This seems to remove the need to implement awkward DOM navigation or SAX handlers (mind you, context aware ones!) but leaves the interpretation/handling of the XML to you own C# code. This should - for example - enable users to easily parse XML and create their internal data representation from it.

XML encoding problems

Feb. 9, 2005, 11:21 a.m. — 3 comments

Attractive Nuisance contains a link to an interesting slideshow about XML encoding problems, starting from charsets going over to attribute order, whitespaces, entities, double escaping etc.

The slide titled "QNames" just reads "don't even get me started". If everyone understands that XML namespaces are broken, why doesn't anyone really do something about it? The W3C seems to have failed in this issue, both with ambigous URIs, difficult to make out QNames and a really strange way of namespace scoping.

Why do XML APIs suck?

Feb. 9, 2005, 12:32 a.m. — 3 comments

The complexity of XML parsing APIs seems to be a general complaint about XML parsing APIs. So why do these APIs suck?

I've worked with the three major API styles myself (DOM, SAX, XML Push thingies) and yes, they do suck. But if you come to think about why they suck and how to make them better you'll find out it might be about your programming language. I've used these APIs with Java and C++ (or C) and it was unbelievably complex and hackish to navigate and recognize the XML structure. Even navigation to some element at the child axis takes a lot more code than it should. Creating XML is just a nightmare, just writing out pointy parentheses to a stringbuffer or the equivalent is way easier than using any API. But after all, how would you formulate that in Java (without just using an XPath library) in a better way?

I think the problem is using an imperative language with good support for single leveled structured data of statically known types to query/modify/whatever a data type which is strongly oriented on hierarchical, dynamic, ordered structures. To really manage this you would need a language that provides built-in support for lists, hierarchical navigation and a good approach to dynamic typing. Also it would need to be extensible to really mary the XML support with the language. So you could either go and create/use something like EAX or CΩ (C-omega) or start with XQuery. XQuery sounds like a better candidate as stable engines with good typing support seem to be a lot less science fiction than the other languages.

Implementing something XML-ish starting of with writing a SAX consumer is IMHO just the wrong approach. It seems to be like implementing a GUI application starting of with raw drawing primitives and a user event queue. Those things have to be done, but they should be done ideally only once. In a slightly less ideal world it will be several times but at least not every application programmer.

Oh and yes, there will be a performance drawback with using an interpreted high-level language like XQuery. But if you really need that performance in the area where your application is dealing with XML you might be either one of the guys who really has to use DOM and SAX or your doing really strange stuff ;-)