Martin Probst's weblog

RFC (2)822 dates in IMAP and Courier

Oct. 10, 2007, 7:44 a.m. — 0 comments

I'm writing a little ruby script to download emails from my IMAP server and put them in a Maildir structure. It's more of a learning exercise - I'm aware that there are working tools for this task, but they all seem a bit complicated in use.

Something strange I noticed is that Courier-IMAP seems to return INTERNALDATE in non-RFC822 (or RFC2822) compliant format:

irb(main):008:0> imap.uid_fetch(16860, 'INTERNALDATE')[0].attr['INTERNALDATE']
=> "01-Jun-2007 09:04:04 +0200"

That should have been "01 Jun 2007 09:04:04 +0200", with an optional "Fri, " in front (no dashes!).

This is probably a problem with the standard. While RFC 3501 (IMAP) does not say anything specific about the correct date format to use, it seems to implicitly reference RFC 2822 for that. It also contains examples in RFC2822 format. Another hint that writing good standards and specs is really hard.

Interesting thing: this does not seem to be a problem in reality. I.e. except for my little script, all Mail clients don't seem to bother and probably do some fuzzy parsing.

Wide Finder in Scala

Sept. 24, 2007, 10:42 p.m. — 4 comments

Tim Bray:

In my Finding Things chapter of Beautiful Code, the first complete program is a little Ruby script that reads the ongoing Apache logfile and figures out which articles have been fetched the most. It's a classic example of the culture, born in Awk, perfected in Perl, of getting useful work done by combining regular expressions and hash tables. I want to figure out how to write an equivalent program that runs fast on modern CPUs with low clock rates but many cores; this is the Wide Finder project.

So while it's probably most sensible to do this with some map/reduce library, I tried implementing it using Scala actors. I'm not a Scala programmer, and have no clue about the Actors library, so this code is probably totally wrong, inefficient etc. But at least I can learn something this way :-)

First the original Ruby script:

counts = {}
counts.default = 0

ARGF.each_line do |line|
  if line =~ %r{GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) }
    counts[$1] += 1
  end
end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
  puts "#{counts[key]}: #{key}"
end

Converted to Scala, that gives for the serial case:

object SerialAnalyzer extends Application {
  val pattern = Pattern.compile("GET /root/(\\\\d\\\\d\\\\d\\\\d/\\\\d\\\\d/\\\\d\\\\d/[^ .]+)")
  val reader = new BufferedReader(new FileReader("/Users/martin/tmp/log"))

  val counts = new HashMap[String, Int]
  var line = reader.readLine
  while (line != null) { 
    val matcher = LogMatcher.pattern.matcher(line)
    if (matcher.find()) {
      val uri = matcher.group(1)
      val count = counts.getOrElse(uri, 0) 
      counts(uri) = count + 1
    }
    line = reader.readLine
  }
}

This takes about 1.5 seconds to go through 250 M of log files on a dual core MacBook Pro 2GHz.

object Analyzer {
  def main(args: Array[String]): Unit = {
    val numAnalyzers = if (args.length > 0) Integer.parseInt(args(0)) else 4
    val logreader = new LogReader(numAnalyzers)
    logreader.start
  }
}

class LogReader(numAnalyzers: int) extends Actor {
  val reader = new BufferedReader(new FileReader("/Users/martin/tmp/log"))
  def hundredLines = (for (val i <- 0 to 10000) yield reader.readLine).toList
  
  val analyzers = (for (val i <- 1 to numAnalyzers) yield new LogMatcher).toList
  analyzers.foreach(_.start)
  
  def act = {
    while (reader.ready) analyzers.foreach(_ ! hundredLines)
    analyzers.foreach(_ ! Stop)
    for (analyzer <- analyzers) {
      receive {
        case result: HashMap[String, Int] => print("Done.\\n")
      }
    }
    val resultMap = new HashMap[String,Int]
    for (map <- analyzers.map(_.counts); (uri, count) <- map) {
      resultMap(uri) = resultMap.getOrElse(uri, 0) + count
    }
    for (entry <- resultMap) print(entry._1 + ": " + entry._2 + "\\n")
  }
}

object LogMatcher {
  val pattern = Pattern.compile("GET /root/(\\\\d\\\\d\\\\d\\\\d/\\\\d\\\\d/\\\\d\\\\d/[^ .]+)")
}
class LogMatcher extends Actor {
  val counts = new HashMap[String, Int]
  
  def act = {
    loop {
      react {
        case lines: List[String] =>
          for (line <- lines if line != null) { 
            val matcher = LogMatcher.pattern.matcher(line)
            if (matcher.find()) {
              val uri = matcher.group(1)
              val count = counts.getOrElse(uri, 0) 
              counts(uri) = count + 1
            }
          }
        case Stop => 
          sender ! counts
          exit()
      }
    }
  }
}

The code does work, but sadly the Actors version is not faster than the single threaded version on my dual core MacBook Pro. No idea why... also the program exhibits some sort of a memory leak - it seems to keep the whole file in memory, thus giving OutOfMemoryErrors if you don't run it with a Java heap big enough for the whole log file. Again, no idea why, I don't seem to keep any nasty pointers to anyone.

So what does this give? Ruby is an elegant language with a nice collections API. Scala is much nicer than Java, but still quite talkative. And I obviously didn't really get something about the Scala actors...

PS: The Ruby version takes about 20 seconds to go through 270 MB of logs. The serial, no concurrency Scala version takes 18.5 seconds. Simply reading the data line-by-line using Scala takes over 12 seconds.

Type inference for Java

April 17, 2007, 10:23 p.m. — 0 comments

InfoQ has an article on Type inference for Java.

Commenter Steve Jones states the following:

Type inference is just a case of complete laziness and is brought to us by the same sort of people who think that typing is the most time consuming part of the exercise. Are there people who really think that the problem with Java is that there are too many characters in a .java file? It would be great to see some efforts focused around making Java a better language for support and professional development. Things like contracts on methods and classes (ala Eiffel) would be nice. Saving 5 characters just because you think that will be quicker? Complete and utter muppetry.

Actually, yes I do. One problem with Java is indeed the amount of code that needs to be written. And generating code using some IDE is not a solution - code gets read a lot more often than written, so helping with writing doesn't help at all. Code needs to be easier to understand, not easier to write.

Java files tend to get enormous in size, even for really mundane tasks. This is partially due to bad APIs (I blogged about my experience of putting an XSLT transformation in a self-contained JAR). The other part is just the language itself. I'd really hope that good type inference could solve a lot of the ugly to read code. It will require quite a change in how to write APIs, i.e. be more explicit on the return type of things in the method name.

But still, reducing the SLOCs must be the major goal of any language. Reduce complexity, and the best measure for complexity is still to this day the number of lines of code. Given any certain task, the solution that requires less code is almost always easier to understand.

The holy grail CSS

April 4, 2007, 8:55 a.m. — 0 comments

Eliotte Rusty Harold writes about a slightly modified version of the "The holy grail" CSS layout on his weblog.

Something that really annoys me with this sort of CSS layout is the need to specify actual widths; be it in percentage or in ems or whatever. What I really want is a CSS layout that has a left and/or right column, and the columns extend to the size they actually need, and the center div shrinks accordingly, maybe in some min/max boundary. This very useful behaviour that HTML tables had seems to be simply impossible with current CSS standards.

It's surprising how much unintuitive, hackish CSS is needed for such stuff, even without the IE hacks, just regular CSS. If one needs to go to these lengths just to get some really basic stuff everyone needs, maybe the spec is simply not that good? I always struggle with CSS, and the number of web pages with CSS layout templates suggests other people have problems, too.

Packages, Classes, Methods - Scopes?

April 1, 2007, 4:57 p.m. — 1 comment

I wonder why there is a distinction between the concepts of packages, classes, and methods. If you think about it a bit, it's all just scopes. You could simply drop the distinctions.

A scope is an basic building block in a programming language that has

Scopes can be instantiated - in the case of packages, it's loading the package. In the case of classes, it's actually instantiating the class. In the case of methods, it's calling the method. Instantiating a scope will initialize the non-static fields to something and yield a a scope instance - the activated package, an object instance or the functions stack frame/closure. The scope instance can then be passed around and members of it can be called.

To make it look nicer, you would probably end up providing some syntactic sugar like how function calls return their 'return value' member by default or something. But there doesn't really seem to be a big conceptual difference here. Packages are currently not used or implemented in this instantiation way (or are they? OSGi?), but I think this might be really beneficial...

The idea probably doesn't hold up, but whatever, it sounded nice at first :-)

DocBook to Word (with free bibliography converter!)

Feb. 16, 2007, 2:02 p.m. — 1 comment

My girlfriend is writing her PhD thesis at the moment. This gives the big problem of what format to use for the document. The options are (as far as I know) LaTeX, DocBook, OpenOffice.org and Word.

LaTeX

I've tried LaTeX on her in the past, and she was not to happy. First I tried LyX, which is simply not working for the task we had. It's actually really confusing, instable, ugly etc. Not a choice. Editing LaTeX source by hand is nothing she wants to do. Also, she needs to be able to manage the document structure etc. by herself, so this simply doesn't work. Pro for LaTeX: really nice documents as a result, citation support is ok, though customising it is a pain.

DocBook

So I went on and looked for a more modern publishing format with decent editors. As Lars Trieloff is a friend of mine I know DocBook and have used it several times in the past. There are editors for it that do not totally suck and it's fairly easy to customise using XSLT. So we tried that for several months.

Net result: citation support totally sucks in DocBook. There are bibliography elements available, but no-one really seems to no which elements to use for what (e.g. how to properly document an article, inproceedings, etc.). This was a major obstacle, and tool support for it is also really bad in the editors we tried (XXE and Serna). XXE is a nice editor, but we settled with Serna because it's output is closer to the real end result and my girlfriend somehow preferred it.

Serna itself is a good XML editor, and they have a very strong technology behind it. The problem is that it doesn't really support DocBook out of the box very well. They have some stylesheets for a very old docbook version shipping with it, put apart from that there isn't much.

Another major problem along with citations: tables. There are the CALTECH tables, which are ridiculously complex to edit, and there are HTML tables, which are complicated to edit. Tool support in Serna is really bad for tables, I'm sorry to say.

So basically, after a long time trying, we gave up on that.

OpenOffice.org

We didn't try that. I've used OpenOffice in the past and I remember it as a horribly slow and ugly copy of MS Word, with a totally weird user interface. I've tried to get citation support to work in OpenOffice but didn't manage to. Also, I've had issues with OpenOffice and corrupted files... so if I want an unstable office suite, I can also use MS Office which is at least a bit easier to use.

Microsoft Word 2007

So we ended up here again. Which is really, really sad. Word 2007 sure is better than previous versions, especially the new user interface is very good. Also, they've added a lot of tools for academia, especially the citation support looks really good. But there are still the same stability problems. I've just tried saving a large document, and Word simply crashed. Plus, it somehow killed the backup copy - after restarting, 30 mins of work were gone. I'm quite scared at the perspective of creating a 300+ pages document with graphics, foot notes, citations etc. in this tool.

But there is no apparent alternative. OpenOffice is no better, plus it sucks more, and the other tools are simply not usable for someone without a strong technical or computer science background. This means frequent backups and crossing the fingers for the next two years...

DocBook exodus and the bibliography

So how did I get the existing document from DocBook to Word? I didn't get those transformation stylesheets available to run, but in the end I've more or less simply imported the full text from the XML document.

I've written a small stylesheet to convert the docbook bibliography to Word 2007 format, available here. It's probably incomplete, erroneous etc. and will kill all your data, but hey, it's at least something ;-)

You can simply run your existing bibliography through it and then select that file as the bibliography source in Word, or alternatively merge it with the existing file using your favorite VIM, uh, text editor I mean ;-)

House across the street

Feb. 12, 2007, 3:10 p.m. — 1 comment

A detail of the old house across the street. It's one of the last houses in Potsdam Babelsberg that has not been renovated after 1989:

Detail old house across the street

Markups raison d'être

Jan. 29, 2007, 9:30 a.m. — 0 comments

David Megginson posted a follow up on his earlier take on JSON, which contains an excellent example/explanation on the advantages of markup compared to hard wired data structures.

He starts of with a simple names example and ends up with the following JSON structure, in a futile attempt to design the be-all-end-all of names data structures:

>    {"type": "name", "gender": "male", "prefix": "Don",
>      "given-name": "Alonso", "surname": "Quixote",
>      "surname-after-given-names": true, "postfix": "de la Mancha"}

Where the minimal needed markup would still be:

>    <name gender="male">Don Alonso <surname>Quixote</surname> de la Mancha</name>

Granted: the JSON code contains a lot more semantic information about it's content, and that's why it's so much more complicated. But that is exactly the point: with XML (or markup in general) you can leave your data structures underspecified like Don Quixote's name, still have meaningful semantic annotations in parts of them, and then later go on and add other information to the data. Properly written clients (i.e. that don't depend on the exact number of XML elements below name) will continue to work without noticing.

Data structures such as the JSON code above require you to actually specify all this stuff upfront, and adding to them later will break any existing clients.

Great post, go read it.

PS: Sean McGrath elaborates.

Tim Bray on links

Jan. 21, 2007, 10:57 a.m. — 0 comments

Tim Bray on links:

Solution: Link Bundles? – If we really care about links being useful in the long term (and we should), maybe we need to abandon the notion that a single pointer is the right way to make one that matters. If I want to link to Accenture or Bob Dylan or Chartres Cathedral, I can think of three plausible ways: via the "official" sites, the Wikipedia entries, and Google searches for the names. [More generally, I should say: direct links, online reference-resource links, and search-based links. I'll come back to that.]

I wonder why someone would want a Google link? Every browser I've used recently has some sort of "highlight term and search" function, and with that function I can pick my search engine myself, plus it's more future proof than a static document. I don't think this adds any value for the user, and it's also semantically awkward - imagine search-linking to "miserable failure"; what is the meaning of that? Not to talk about links being fragile ...

But the multi-link idea is indeed needed and implemented on many webpages, though typically as a special page about a e.g. certain company with links to all related information - web page, stock ticker, recent announcements, articles. The question is not if people need multi-links but rather if they need a special technical solution for that. What would be the benefit of XLink over this way of simply implementing it on top of HTML?

Extensible modules in XSLT and XQuery

Jan. 20, 2007, 1:14 p.m. — 0 comments

Michael Kay notices something on the extensibility of XSLT vs. XQuery code:

(2) It's interesting to note how hard it would be to do this in XQuery. The main XQueryX stylesheet certainly benefits immensely from XSLT's top-down apply-templates processing model, but in theory it could have been written in XQuery. The modification layer, however, that changes the behaviour of the transformation to do something slightly different, would be quite impossible to add without modifying the source code of the original query. This is an observation I've made in a number of larger applications: once you want to write code that can be reused for more than one task, XSLT has quite a few features that make it a stronger candidate for the job than XQuery.

Where "this" is importing an existing stylesheet and overwriting part of it's implementation by redefining a specific template. This is indeed something that is currently very difficult in XQuery. Templates are somewhat like a function in XSLT, and it's possible to selectively overwrite/replace certain templates/functions.

Quite remarkably this kind of extensibility wouldn't even possible if XQuery had higher order functions - in that case the original module would have had to anticipate that someone might want to replace that certain function.