Martin Probst's weblog

Server move

Sunday, September 6, 2009 7:16 PM — 0 commentsEdit

As you might know, I used to run this weblog on a virtual server hosted by Hosteurope, the smallest possible configuration. However a small virtual server doesn't seem to be enough for even the smallest weblog possible (at least when it's written in Rails...), so I moved this weblog to my own server today.

The trick is that I got a DynDNS domain and point my real domain (martin-probst.com) to that one through a CNAME record, and the media server / TV in my living room happily serves the files.

$ dig www.martin-probst.com
[... snip ...]
;; ANSWER SECTION:
www.martin-probst.com.	81215	IN	CNAME	martinpr.homeip.net.
martinpr.homeip.net.	60	IN	A	92.225.50.126

The server is a new Mac mini, so it will certainly not have the dreaded out of memory problems. I think I'm even saving money - Apple claims an idle energy consumption of about 14 Watt, which should be slightly cheaper than my server hosting in total. Of course this calculation doesn't include the hardware, but I wanted that media server anyway ;-)

In the process I also upgraded Rails to 2.3.4, which was a bit painful. But I came from 1.2.something, so some friction probably has to be expected.

Hardware on Ubuntu, once again

Wednesday, March 11, 2009 5:54 PM — 0 commentsEdit

This is really getting ridiculous. Today I wanted to scan some document, and after some googling and searching I found out that the old & crappy USB Scanner I have here (Mustek Bearpaw 1200 CU) doesn't work on Mac OS X, does theoretically work on Windows XP, but the driver is so bad it crashes the OS all the time, but getting it to work on Ubuntu is trivial.

I still remember the times when hardware support on Linux was really bad, and getting your Wifi to work was a matter of luck. For Wifi is hear it's still not totally easy, but my experience with Windows is that it's no better on the Wifi front...

Mobile phone contracts

Tuesday, February 17, 2009 8:55 PM — 1 commentsEdit

Recently, I changed my mobile phone provider from O2 to Simyo. It's quite funny - the regular, contract based mobile phone providers should be delivering a premium for the fact that you pay them a monthly fee and bind yourself to a commonly two year contract. And it's quite the opposite. With Simyo, I can now actually understand my bills, they have web tools that are actually useful, and I'm paying a lot less. O2 and the other providers appear to be investing the premium money mostly into commercials and sales (all these mobile phone shops in the towns must be really expensive...).

To me, usable web tools and understandable bills are a majore feature in providers of anything, even at a potential slight premium. The complete failure of most phone-related companies at this is really a shame. I would actually happily switch my fixed line provider (Alice) for another one, if I just knew a German telephone company that was actually any better.

Shell meta programming

Monday, December 8, 2008 11:08 AM — 4 commentsEdit

I'm currently reworking X-Hive/DBs command line startup scripts for various utilities, and I'm facing an interesting challenge with shell programming.

The issue is that I want to have a ".xhiverc" file that contains various settings in a Java property file style. Normally, I would simply read those settings from within Java, and everything would be nice and fine. But this file is supposed to contain, amongst others, the memory settings for the virtual machine - and once the JVM is running, it's of course too late to read those.

So I need to somehow read the file from the shell. That should be easy, right? ". ~/.xhiverc" and everything is fine - or maybe not. What if the user wants to override those settings from the environment? E.g., we have XHIVE_MAX_MEMORY defined in the .xhiverc, but the user has exported XHIVE_MAX_MEMORY="2G". This is where the meta programming comes in: we have operate on variables of which we don't know the name statically.

Current solution: iterate through all legal variable names, save their state in ${VARNAME}_BACKUP, source the .xhiverc, and then re-set them to the previous value if they were non-empty. As the scripts need to be POSIX compliant (i.e., no bashisms), we don't have ${!VARNAME}, so this already involves some interesting eval scripts (eval export ${var}_BACKUP=\\"\\$${var}\\" - the backslashes are not a Wordpressian/PHP escaping problem).

Now the next interesting thing: how to test if a variable is set? Testing if it's empty is [ -n "${VARNAME}" ], but what if someone wants to override a default setting to be undefined? If you know the name, it's "${XHIVE_MAX_MEMORY+x}" = "x". If you don't, it's again some horrible eval combination - maybe I'm missing it, but there doesn't seem to be a standard "defined" command/test.

I have the feeling I'm doing something wrong - this should be easier (tm). Maybe I should just forget about the whole thing, and have a XHIVE_DEFAULT_MAX_MEMORY and a second XHIVE_MAX_MEMORY, same for the other variables...

What surprised me along this, this of course also has to work in Windows batch. And everyone knows that Windows batch is probably one of the most horrible programming environments ever "invented". But this particular problem is actually not too difficult. Once someone on StackOverflow.com enlightened me over the byzantine details of the Windows batch FOR loop, it's a relatively simple loop containing an IF DEFINED %%i:

  FOR /F "eol=# tokens=1,2* delims==" %%i in ('type "!XHIVERC!') do (
    REM only set variables if not already defined as environment variables (they take precedence)
    IF NOT DEFINED %%i (
      SET %%i=%%~j
    )
  )

SSD is the new disk, disk is the new tape

Friday, November 21, 2008 9:34 AM — 0 commentsEdit

Tim Bray has some very interesting performance numbers for storage systems.

There is this saying that memory is the new disk, disk is the new tape. I think we have to insert something there - SSD is the new disk, disk is the new tape, and memory is somewhere between the CPU cache and the SSD.

The problem is then, how to benefit from these enhancements. If you have ye olde database system, you could simply put all of the data on SSD. This will be fast, but quite a bit of a waste. DBMSes currently manage the cache hierarchy on their own, having a memory cache for the really hot data, a disk storage for the not-so-hot, and tapes for backups.

It would be really nice if the DBMS was aware of the wildly different seek times of SSDs and disks, and if it thus could manage this aspect of the storage hierarchy, too. Ideally, it would lazily remember which data was accessed recently, and move the old stuff to disk. For example, in everyones favorite running performance example - called "Twitter" - presumably next to no one cares about tweets that are older than a month or so, so you could move them to tape disk.

This is again a good example of a change in requirements for databases which as it is now requires developers to implement the smarts themselves. Let's hope databases will learn this...

Java & Ruby complexity

Wednesday, November 19, 2008 7:33 AM — 0 commentsEdit

Patrick Mueller writes:

Same sort of nutso thinking with Java. A potentially decent systems-level programming language, it could have been a successor to C and C++ had things worked out a bit differently. But as an application programming language? Sure, some people can do it. But there's a level of complexity there, over and above application programming languages we've used in the past - COBOL and BASIC, for instance - that really renders it unsuitable for a large potential segment of the programmer market. [...] We're seeing an upswing in alternative languages where Java used to be king: Ruby, Python, Groovy, etc

I really don't agree with the notion of complexity in Java. Complexity as a term is IMHO highly unprecise, so maybe we're just thinking differently about it here.

Much of the stuff people people don't like about Java is actually it's verboseness (compared to e.g. Ruby), but that's nearly the opposite of complexity. The inventors of Java explicitly left a lot of features out - like closures - because they feared they would create a too complex programming language.

Ruby & Co have all these features, plus a lot of nice meta programming, and a somewhat weird module/inclusion/inheritance system. I personally think that Ruby is much more complex than Java in the long term. The interesting question is whether people will be happy with the added complexity in the long term.

I see this as a trade off in programming languages: language features like cool meta programming, closures, or a really worked out type system (a la OCaml & Haskell) can remove a lot of accidental complexity: with them. you're able to write programs much more succinct, or have proofs of global properties of you're program that weren't possible before.

On the other hand, language features can create a lot of complexity, if not done really well. I'm reading the Scala mailing list, and I remember discussions of the sort "is this code legal Scala? and if it is, what does it mean?" (usually from a type system point of view), and if I remember correctly the language designers weren't quite sure about it either. This is exactly what you don't want in a language: unclarity or ambiguity of expressions, unexpected "side effects" of expressions.

Quite a lot of Ruby/Rails code one happens to see is clever in very interesting ways. But I really see that cleverness as a problem: who will understand the tricks that made the code a bit shorter in five years? Probably someone, but it might take him a long time to do so. Already now it's sometimes quite difficult to find documentation on a particular library method/class in Ruby, as the documentation system is apparently not up to handle the language's module inclusion features.

At what point do all these clever tricks sum up to something that is no longer understandable? Are we really sure that the modularization works out good enough that we don't have to be afraid of all ending up as a large meta-closure-soup? ;-)

Don't get me wrong: I like dynamic languages for a lot of features. I'm just weary of some of the effects. Pushing accidental complexity out of the application and into the programming language (now as feature complexity) should normally be a good thing: it sounds reasonable that this should reduce overall complexity, and give programmers a broader understanding of what's happening. We need a good modularity system and proper abstractions to have a real positive effect from this - and I'm not sure I see this in e.g. Ruby.

Databases and Caching

Tuesday, November 11, 2008 6:24 PM — 1 commentsEdit

Dare Obasanjo compares database caching with how compilers manage the various CPU caches (e.g., L1, L2). Surprisingly he comes to the conclusion that you need to implement your own caching scheme through memcached and friends because in the database situation the amount of data is so large:

The biggest problem is hardware limitations. A database server will typically have twenty to fifty times more hard drive storage capacity than it has memory. Optimistically this means a database server can cache about 5% to 10% of its entire data in memory before having to go to disk. [...] So the problem isn't a lack of transparent caching functionality in relational databases today. The problem is the significant differences in the storage and I/O capacity of memory versus disk in situations where a large percentage of the data set needs to be retrieved regularly.

I wonder how this is different from the sizes of L2 cache compared to main memory?

It's even worse: on a regular PC you might have 4MB L2 cache, but 4 GB of main memory. That's about 0.1 % compared to main memory - so databases actually have a relatively luxurious position, compared only by relative data size.

Application knowledge

Quite the contrary, I believe the problem is not in the data sizes, but in the optimization hints available to databases (and potentially the smartness of the database caching methods). With a good compiler, in particular a JIT, you can easily judge what data will be used in the near future in the code execution, and through data flow analysis and fancy register allocation tricks a compiler has a pretty complete knowledge of what the code tries to do, so it can optimize cache usage (and a whole lot of other things) very efficiently.

Compared to this, databases have little knowledge of the data access patterns of the application. They can only rebuild this knowledge from observations on the queries hitting them, but they don't seem to be very successful there judging from the frequency you hear people talking about memcached. I'm not sure why that is exactly, maybe because it's always more difficult to implement optimizations based on observations of dynamic data than on static knowledge?

One possible problem is probably that the database doesn't necessarily know - and in many situations probably cannot even guess - which data structures are commonly displayed together for a certain webpage, so caches can be invalidated together, or directly stored together.

It might be an interesting thought experiment to think what would be possible if the database was integrated with the application logic in a way that would make the application knowledge available to the database. This could probably lead to interesting changes regarding invalidation and cache organization. I know there were (and probably still are) some things going into this direction in the Smalltalk environment, but I have no idea if they really take advantage of the application knowledge. Probably not, as most Smalltalk is highly dynamic and I don't think they've put the emphasis into declarative programming that would be needed for this.

Cache Granularity

Also interesting might be the fact that the granularity of objects in a relational database is quite different from the application perspective. Relational databases store entities in rows in tables, but (web-)applications have a hugely different data model. A single application entity, e.g., a user, will span over several tables. But the pages the database uses as the unit of caching usually only contain data from one table. If you have 16k of cache, that might be enough memory for several hot data entities, but because the database caches tables, not application entities, much of those 16k will be filled with rarely used rows, e.g., 4k from the users table, 4k from the mood messages table, 4k from the friends table, and so on. Application developers fight this (and the processing cost of joins operations) with denormalization, which is basically a hack to reduce the number of tables an entity spans.

This all boils down to the fact that relational databases were designed for mass-data processing, like in financial institutions, where large calculations over huge tables of uniform data with little nested structures are the common operation.

I think this is one of the area where non-relational databases, like XML databases, are going to have a bright future. The data model, and thus the unit of caching, is much closer to what today's content-centric application's data actually looks like. It's not only much easier to program without that impedance mismatch, it can also have significant performance advantages over RDBMSes.

EMC += Martin Probst

Tuesday, November 4, 2008 10:20 AM — 1 commentsEdit

Some of you already know it: I've joined EMC Corporation, starting on Oct 15th. I'll be teleworking from Potsdam but commuting every month to Rotterdam for one week, which is an ideal arrangement for me.

EMC acquired X-Hive, my former employer in the Netherlands, in summer last year, so I'm basically returning to the company I left some time ago, mostly to finish my studies and try some stuff out (like taking a look into SAPs corporate innards, doing some freelance work).

I'm very happy that this worked out. X-Hive used to be a great employer, with really nice people and very interesting work. And now they suddenly pay much better ;-). Seriously: from my first days, it seems as if the influence of EMC is very good. Of course there is some corporate beaurocracy (which so far seems very acceptable), but on the other hand there are some seriously smart people giving input into my favorite native XML database. There is a huge set of really cool requirements to match, and X-Hive/EMC now certainly has the resources to fulfill them.

I'm very happy to work on this really cool product once again, and I'm particularly happy that it is probably going to have a much larger impact very soon. Nice times.

Time Machine works

Tuesday, November 4, 2008 6:26 AM — 0 commentsEdit

I'm happy to report that I tested Time Machines Backup-Restore capability yesterday evening, and it works, sigh.

I brought my MacBook Pro (1st gen.) in for repairs, because the left fan failed again, and I also had them build in a bigger hard drive. Restore from Time Machine took about 2.5 hours, maybe a bit more, but afterwards you're directly booting into your complete system. Very nice!

On the fan: Apple decided to fix it on guarantee, as the very same fan failed a bit less than two years ago. Also, not long before that fan, the right fan failed. So Apple, you're building a computer that sells for more than 2000 €, and you cannot build/buy/ship fans that last more than a year?

Update: after booting into Mac OS, you'll have to re-import your Mail.app emails, and then the next Time Machine run took ages, at least for me. My machine crashed twice while importing the mails before I increased fan speed - apparently there is still a heat problem. I also get some weird graphics errors (violet areas in windows, horizontal lines). I think I will get to know some more people in the Apple hotline shortly ...

Trying to control the heat issue, I switched from smcFanControl to Fan Control. smcFanControl only allows a user to set specific fan speeds (via presets), where Fan Control dynamically adjusts fan speed depending on current temperature (which in turn depends on work load). So if you have a large job running, it will dynamically increase fan speed a bit more than Mac OS would, to keep your computer a bit cooler. Nice.

.NET VM sizes and Java

Wednesday, October 22, 2008 3:39 PM — 0 commentsEdit

I just installed all the new Windows XP patches in my virtual machine running parallels, including a bunch of hotfixes and service packs for Microsoft .NET.

As the download took quite a while, curious, I checked the size of the various .NET frameworks after installation. According to the Add/Remove programs dialog (no idea where these files really end up, so I can't check directly), .NET 2.0 and 3.0 consume ~280 MB and ~335 MB respectively, including the German language packs, each at ~100 MB. For the also installed .NET 1.1 there is no size given, but I'd guess it's not that much smaller then .NET 2.0, so about 150 MB should be a conservative guess.

So in summary, to run Microsoft .NET programs, I spend probable well above 750 MB of hard drive space. Compare to Java, where one version suffices due to backwards compatibility, where the JRE is 114 MB (and the JDK is larger, but not by much if I remember correctly). This is actually a pretty good argument for at least some investment in backwards compatibility.

I wonder what they include in those .NET packs? And what if they continue "innovation" (or whatever) at this pace? In 6 years since 2002 there are 3 incompatible version, does that mean we'll have to install 2 GB in 6 frameworks in another 6 years?