Martin Probst's weblog

RubyGems upgrade 0.9.5

November 21, 2007 at 19:56 #

RubyGems is available in version 0.9.5. And installing it with gem update --system broke about everything on my system :-(

Gems couldn't find any already installed gem anymore. The fix was easy, though a bit time consuming: re-install all gems you have.

$ export installed_gems="`ls /Library/Ruby/Gems/gems/ | sed s/-[^-]*$//g | sort -u`"
$ gem install $installed_gems

(with /Library/Ruby/Gems/gems/ being the path to your gems)


InputManagers in Leopard

November 20, 2007 at 21:12 #

In Leopard, InputManagers need to be installed in /Library and owned by root, for security reasons. Tutorials how to re-enable them can be found e.g. on Mac OS X hints or in the TextMate blog.

Something not mentioned in those tips is that not only the input managers themselves but also the InputManagers directory must be owned by root and only writable by the owner (g-w).

These two commands did the trick for me:

$ sudo chown -R root:wheel /Library/InputManagers
$ sudo chmod -R go-w /Library/InputManagers

By the way, does anyone know what the '@'-sign after the rights in a directory listing means? As in drwxr-xr-x@ 11 root wheel 374B 11 Sep 04:25 SafariBlock?


MySQL backup/restore task for Capistrano

November 20, 2007 at 10:59 #

This is a simple backup task for Capistrano (which could really use a lot more documentation...).

The tool reads database configuration from the local database.yml. This Works For Me (tm) as I keep the local and remote database configuration identical - YMMV.

While the script doesn't require you to type the database user's password, it will echo it to the console for the restore task. Avoiding that seems to be quite tricky - I tried sending the backup directly over the stream and piping in the password before, but that gives an obscure error.

So the following will have to do for now, but I'm quite pleased with it. I should probably include a warning/confirmation before restoring, but hey, command lines are for experts ;-)

$config = YAML.load_file(File.join('config', 'database.yml'))

desc "Backup the database to db/" + Time.now.strftime("backup_#{$config['production']['database']}_%Y-%m-%e.sql")
task :backup, :roles => :db, :only => { :primary => true } do 
  backup_path = File.join('db', Time.now.strftime("backup_#{$config['production']['database']}_%Y-%m-%e.sql"))
  on_rollback { delete backup_path, :recursive => false }
  backup_file = File.new(backup_path, 'w+')
  run "mysqldump --default-character-set=utf8 " +
    "--user=#{$config['production']['username']} " +
    "--password " +
    "-B #{$config['production']['database']}" do |channel,stream,data|
    if stream == :out
      backup_file.write(data)
    else
      if data =~ /^Enter password:/
        channel.send_data($config['production']['password'])
        channel.send_data("\n")
      else
        raise Capistrano::Error, "unexpected output from mysqldump: " + data
      end
    end
  end
  logger.info "Database dumped to #{backup_path} successfully."
end

desc "Restore the database from backup"
task :restore, :roles => :db do
  backups = Dir[File.join('db', "backup_#{$config['production']['database']}_*.sql")]
  raise Capistrano::Error, "no backup found!" if backups.size == 0
  last_backup = backups.sort[-1]
  put(File.read(last_backup), "#{current_path}/db/restore.sql")
  logger.info "Restoring from #{last_backup}"
  run "mysql --default-character-set=utf8 " +
    "--user=#{$config['production']['username']} " +
    "--password=#{$config['production']['password']} " do |channel, stream, data|
    raise Capistrano::Error, "unexpected output from mysql: " + data
  end
  logger.info "Restored successfully."
end

New blog engine

November 20, 2007 at 10:59 #

I ported my old WordPress blog over to a hand-written Ruby solution. You probably already noticed that my permalinks were not that perma, so apologies for re-appearing entries in your feed readers.

I decided to move away from WordPress after taking a look in my archives. Through various import/export operations and the liberal re-formatting of entries - done by WordPress itself or various plugins - the data in the database was a complete mess. Corrupt UTF-8, double, triple and quad escaped anything, mixed encoded and non-encoded HTML... took me quite some time to clean it up (thank God for RegExps).

Writing a simple blog in Ruby on Rails is an easy exercise, at first. It gets a lot more complicated once you consider trackbacks/pingbacks, proper permalinks, comment spam, etc., but more on that in separate entries.


Migrating to Google Apps (copying IMAP mails)

October 29, 2007 at 12:35 #

Now that Google has announced IMAP support for Gmail I’m migrating my email to Google Apps.

I’ve always had a HostEurope WebPack that provides some webspace, PHP, MySQL and IMAP. Some time ago I also ordered a virtual root server, to have some fun with rails, and a general space for experimentation. Then I wanted to take the webpack down as I didn’t need it anymore. But to be honest, I soon figured out that configuring and properly maintaining a whole email setup (MTA, IMAP, various spam filters, …) is indeed a lot of work.

So I moved all my email related stuff to Google Apps. So far it looks quite nice. It’s a bit strange that my regular Google user account didn’t integrate with the new one, but I simply dropped the old account.

Now I’m copying all my IMAP emails over to Google Mail. Surprisingly, I couldn’t find an easy to use, readily working script to copy IMAP messages from one host to another. There are several, but they seem to be either unmaintained, requiring obscure dependencies, or require bizarrely complicated setup.

So in a first class wheel reinvention act I wrote my own IMAP copy tool, in ruby; imapcopy.rb. Only dependency is highline for the password prompt, but if you don’t want that, you can easily adapt the code.

I really like it: it does everything I needed, doesn’t require any configuration, it only copies messages that are not present on the new host, and even prints a nice spinner ;-). Sample usage:

ruby imapcopy.rb user1@somehost.com user2@gmail.com@gmail.com
Password for user1@somehost.com:
Password for user2@gmail.com@gmail.com:
...

RFC (2)822 dates in IMAP and Courier

October 10, 2007 at 07:44 #

I'm writing a little ruby script to download emails from my IMAP server and put them in a Maildir structure. It's more of a learning exercise - I'm aware that there are working tools for this task, but they all seem a bit complicated in use.

Something strange I noticed is that Courier-IMAP seems to return INTERNALDATE in non-RFC822 (or RFC2822) compliant format:

irb(main):008:0> imap.uid_fetch(16860, 'INTERNALDATE')[0].attr['INTERNALDATE']
=> "01-Jun-2007 09:04:04 +0200"

That should have been "01 Jun 2007 09:04:04 +0200", with an optional "Fri, " in front (no dashes!).

This is probably a problem with the standard. While RFC 3501 (IMAP) does not say anything specific about the correct date format to use, it seems to implicitly reference RFC 2822 for that. It also contains examples in RFC2822 format. Another hint that writing good standards and specs is really hard.

Interesting thing: this does not seem to be a problem in reality. I.e. except for my little script, all Mail clients don't seem to bother and probably do some fuzzy parsing.


Wide Finder in Scala

September 24, 2007 at 22:42 #

Tim Bray:

In my Finding Things chapter of Beautiful Code, the first complete program is a little Ruby script that reads the ongoing Apache logfile and figures out which articles have been fetched the most. It's a classic example of the culture, born in Awk, perfected in Perl, of getting useful work done by combining regular expressions and hash tables. I want to figure out how to write an equivalent program that runs fast on modern CPUs with low clock rates but many cores; this is the Wide Finder project.

So while it's probably most sensible to do this with some map/reduce library, I tried implementing it using Scala actors. I'm not a Scala programmer, and have no clue about the Actors library, so this code is probably totally wrong, inefficient etc. But at least I can learn something this way :-)

First the original Ruby script:

counts = {}
counts.default = 0

ARGF.each_line do |line|
  if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
    counts[$1] += 1
  end
end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
  puts "#{counts[key]}: #{key}"
end

Converted to Scala, that gives for the serial case:

object SerialAnalyzer extends Application {
  val pattern = Pattern.compile("GET /root/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+)")
  val reader = new BufferedReader(new FileReader("/Users/martin/tmp/log"))

  val counts = new HashMap[String, Int]
  var line = reader.readLine
  while (line != null) { 
    val matcher = LogMatcher.pattern.matcher(line)
    if (matcher.find()) {
      val uri = matcher.group(1)
      val count = counts.getOrElse(uri, 0) 
      counts(uri) = count + 1
    }
    line = reader.readLine
  }
}

This takes about 1.5 seconds to go through 250 M of log files on a dual core MacBook Pro 2GHz.

object Analyzer {
  def main(args: Array[String]): Unit = {
    val numAnalyzers = if (args.length > 0) Integer.parseInt(args(0)) else 4
    val logreader = new LogReader(numAnalyzers)
    logreader.start
  }
}

class LogReader(numAnalyzers: int) extends Actor {
  val reader = new BufferedReader(new FileReader("/Users/martin/tmp/log"))
  def hundredLines = (for (val i <- 0 to 10000) yield reader.readLine).toList
  
  val analyzers = (for (val i <- 1 to numAnalyzers) yield new LogMatcher).toList
  analyzers.foreach(_.start)
  
  def act = {
    while (reader.ready) analyzers.foreach(_ ! hundredLines)
    analyzers.foreach(_ ! Stop)
    for (analyzer <- analyzers) {
      receive {
        case result: HashMap[String, Int] => print("Done.\n")
      }
    }
    val resultMap = new HashMap[String,Int]
    for (map <- analyzers.map(_.counts); (uri, count) <- map) {
      resultMap(uri) = resultMap.getOrElse(uri, 0) + count
    }
    for (entry <- resultMap) print(entry._1 + ": " + entry._2 + "\n")
  }
}

object LogMatcher {
  val pattern = Pattern.compile("GET /root/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+)")
}
class LogMatcher extends Actor {
  val counts = new HashMap[String, Int]
  
  def act = {
    loop {
      react {
        case lines: List[String] =>
          for (line <- lines if line != null) { 
            val matcher = LogMatcher.pattern.matcher(line)
            if (matcher.find()) {
              val uri = matcher.group(1)
              val count = counts.getOrElse(uri, 0) 
              counts(uri) = count + 1
            }
          }
        case Stop => 
          sender ! counts
          exit()
      }
    }
  }
}

The code does work, but sadly the Actors version is not faster than the single threaded version on my dual core MacBook Pro. No idea why... also the program exhibits some sort of a memory leak - it seems to keep the whole file in memory, thus giving OutOfMemoryErrors if you don't run it with a Java heap big enough for the whole log file. Again, no idea why, I don't seem to keep any nasty pointers to anyone.

So what does this give? Ruby is an elegant language with a nice collections API. Scala is much nicer than Java, but still quite talkative. And I obviously didn't really get something about the Scala actors...

PS: The Ruby version takes about 20 seconds to go through 270 MB of logs. The serial, no concurrency Scala version takes 18.5 seconds. Simply reading the data line-by-line using Scala takes over 12 seconds.


Type inference for Java

April 17, 2007 at 22:23 #

InfoQ has an article on Type inference for Java.

Commenter Steve Jones states the following:

Type inference is just a case of complete laziness and is brought to us by the same sort of people who think that typing is the most time consuming part of the exercise. Are there people who really think that the problem with Java is that there are too many characters in a .java file? It would be great to see some efforts focused around making Java a better language for support and professional development. Things like contracts on methods and classes (ala Eiffel) would be nice. Saving 5 characters just because you think that will be quicker? Complete and utter muppetry.

Actually, yes I do. One problem with Java is indeed the amount of code that needs to be written. And generating code using some IDE is not a solution - code gets read a lot more often than written, so helping with writing doesn't help at all. Code needs to be easier to understand, not easier to write.

Java files tend to get enormous in size, even for really mundane tasks. This is partially due to bad APIs (I blogged about my experience of putting an XSLT transformation in a self-contained JAR). The other part is just the language itself. I'd really hope that good type inference could solve a lot of the ugly to read code. It will require quite a change in how to write APIs, i.e. be more explicit on the return type of things in the method name.

But still, reducing the SLOCs must be the major goal of any language. Reduce complexity, and the best measure for complexity is still to this day the number of lines of code. Given any certain task, the solution that requires less code is almost always easier to understand.


The holy grail CSS

April 4, 2007 at 08:55 #

Eliotte Rusty Harold writes about a slightly modified version of the "The holy grail" CSS layout on his weblog.

Something that really annoys me with this sort of CSS layout is the need to specify actual widths; be it in percentage or in ems or whatever. What I really want is a CSS layout that has a left and/or right column, and the columns extend to the size they actually need, and the center div shrinks accordingly, maybe in some min/max boundary. This very useful behaviour that HTML tables had seems to be simply impossible with current CSS standards.

It's surprising how much unintuitive, hackish CSS is needed for such stuff, even without the IE hacks, just regular CSS. If one needs to go to these lengths just to get some really basic stuff everyone needs, maybe the spec is simply not that good? I always struggle with CSS, and the number of web pages with CSS layout templates suggests other people have problems, too.


Packages, Classes, Methods - Scopes?

April 1, 2007 at 16:57 #

I wonder why there is a distinction between the concepts of packages, classes, and methods. If you think about it a bit, it's all just scopes. You could simply drop the distinctions.

A scope is an basic building block in a programming language that has

  • a parent scope
  • a set of child scopes
  • a set of static fields
  • a set of non-static fields

Scopes can be instantiated - in the case of packages, it's loading the package. In the case of classes, it's actually instantiating the class. In the case of methods, it's calling the method. Instantiating a scope will initialize the non-static fields to something and yield a a scope instance - the activated package, an object instance or the functions stack frame/closure. The scope instance can then be passed around and members of it can be called.

To make it look nicer, you would probably end up providing some syntactic sugar like how function calls return their 'return value' member by default or something. But there doesn't really seem to be a big conceptual difference here. Packages are currently not used or implemented in this instantiation way (or are they? OSGi?), but I think this might be really beneficial...

The idea probably doesn't hold up, but whatever, it sounded nice at first :-)