My friend Randy posted a few days ago on Grid, Cloud, HPC … What’s the Diff?. I started to make a comment on the blog, but it was getting too long so I moved it here.

Randy does a good job of pinning down both performance and scalability, but in my experience productivity trumps both. This is another way of saying that there are sometimes smarter ways of reaching an outcome than brute force. There’s a DARPA initiative spun up around this – High Productivity Computer Systems, which I think came about when somebody looked at what Moore’s law implied for the energy and cooling characteristics of a Hummvee full of C4I kit.

A crucial point when considering productivity is to look at the overall system (which can be a lot more than what’s in the data centre). Whilst it may be possible to squeeze the work done by hundreds of commodity machines onto a single FPGA that has implications for development and maintenance time and effort. In a banking environment where the quants that develop this stuff are far from cheap it may actually be worth throwing a few $m at servers and electricity rather than impacting upon developer productivity.

I’d argue that the lines between message passing interface (MPI) workloads and embarrassingly parallel problems (EPP) are blurrier than Randy makes out. It’s all about the ratio of compute to data, and dependency has a lot of importance – if the outcome of the next calculation depends on the results of an earlier one then you can end up shovelling a lot of data around (and high speed, low latency interconnect might be vital). On the other hand if the results of the next calculation are independent of previous results then there’s less data to be wrangled. Monte Carlo simulation, which is used a lot in finance, tends to have less data dependency than other types of algorithm.

Most ‘EPP’ are low in data dependency (usually with just initial input variables and an output result), and so the systems are designed to be stateless (in an effort to keep them simple). This causes a duty cycle effect where some time is spent loading data versus the time spent working on the data to produce a result (and if the result set is large there may also be dead time spent moving that around). Duty cycles can be improved in many cases by moving to a more stateful architecture, where input data is passed by reference rather than value and cached (so if some input data is there already from a previous calculation it can be reused immediately rather than hauled across the network). This is what ‘data grid’ is all about.

Getting back to the question of ‘grid’ versus ‘cloud’, I agree with Randy that there’s a big overlap, and it’s encouraging to see services like Amazon’s Cluster Computer Instance (and it’s new cousin the Cluster GPU instance). I will however return to a point I’ve made before – Cloud works despite the network, not because of it. The ‘thin straw’ between any data and the cloud capability that may work on it remains painful – making the duty cycle worse than it might otherwise be. For that reason I expect that even those with ‘EPP’ type problems will have to think more carefully about the question of stateful versus stateless than they may have done before. It matters a lot for the overall productivity of many of these systems.


It seems that our politicians are easily fooled by the telecos and their regulatory capture. Just yesterday the UK’s Culture minister Ed Vaizey announced his support for a ‘two-speed‘ internet. The idea is superficially attractive – content providers pay a premium to have their stuff delivered faster, and the consumer benefits from improved service. It’s like the company you buy petrol from also paying your road tolls.

The problem – there is no faster. There is only slower.

This isn’t about BT etc. building the online equivalent of the M6 Toll. This is about BT etc. building the online equivalent of the M4 bus lane.

For sure it would be nice if somebody was building the extra physical infrastructure to bring me a faster internet. But that’s not what’s happening here. The UK’s ambitions are still set desperately low at providing 2Mbps services for all, and now our politicians want to allow the open part of that to be slowed down even more.

Let’s also figure out who pays, and for what… ‘Heavy bandwidth’ services (anything that distributes video) are singled out as ‘most likely to be hit with higher charges’. These services already pay for big fat pipes, and it’s fair to ask why should they pay again? It’s also fair to ask who does the paying? With YouTube and ITV.com charges could be passed back to advertisers, but I fail to see the win here. With BBC iPlayer it would seem that there’s an expectation that a part of the TV license fee should be used as a telco subsidy (having failed to get a ‘broadband tax’ into the last finance bill). Nice money if you can get it. I wonder how much BT spends on lobbying, and how many fibre to the home roll-outs that would buy?


My Kindle died two days ago :(

Thankfully Amazon were pretty prompt in dealing with this. I called their Kindle helpline[1], a new one was sent out, and it arrived this morning :)

It didn’t take long for me to restore all of my content, and things like bookmarks seem to be intact :)

Unfortunately there doesn’t seem to be any way to backup/restore collections. I’d only had my original Kindle for a few weeks – enough time to accumulate 7 pages worth of content, which I’d organised into collections that neatly fitted onto the home page. Sadly those collections were gone with my old Kindle, and it took me about 20 minutes of repetitive clicking and cursoring to get things back how I like them. I dread to think what that process would be like if I’d been an early adopter and spent years curating content into collections. Amazon – this needs fixing.

After a bit of searching around I did find Adrian Colegate’s Kindle Collection Manager, which would probably have been handy in speeding up the process of recreating my collections. Sadly it doesn’t yet have the ability to backup/restore collections (and it also doesn’t seem to handle audiobooks and pictures [or mp3s?]).

[1] Who were quick, courteous, and didn’t make me jump through loads of hoops before determining that a warranty replacement was needed.


I’ve been meaning to try this out for some time, and my recent trials with Amazon’s US Kindle Store prodded me into action.

The theory

Popular SSH clients (like OpenSSH and Putty) have features that allow tunneling. Amazon Machine Images (AMIs) normally have some way of being administered over SSH. So… start up an EC2 instance where you want a tunnel end point (in my case the US) SSH into it, and let some web traffic hitch a ride.

The practice

Finding an AMI

There are LOTS of AMIs out there, and the requirement for this is nothing more than an SSH daemon. For the purpose of my test I used one of my old AMIs that I made with CohesiveFT Elastic Server.

Make sure that SSH ports are open

The security group that you will start your machine in should be reachable on port 22 from your IP. This can be set in the AWS console or using ec2-auth:

ec2auth securitygroup -P tcp -p 22 -s your_physical_machine_IP/32

Start your AMI

From the AWS console or EC2 API Tools (or whatever your preferred higher level tool is).

SSH into your EC2 instance

For this you will need an instance name (like ec2-50-16-32-41.compute-1.amazonaws.com) or its IP (50.16.32.41) a username and password (and in some cases you may also need a correctly configured private key). There are two methods that I’ll illustrate (and I’ll use 9999 as the proxy port, but you could use anything – 8000, 8080 and 8888 are other popular choices)

Using OpenSSH

On Windows you’ll need Cygwin to get OpenSSH. On most Linux distros it should be there already.

ssh -D 9999 username@instance_name

Using Putty

Put the instance name in:

Then add an SSH Tunnel before clicking Open:

Once you sign in you should have a working tunnel, now to use it.

Configuring Firefox to use the tunnel

I use Chrome as my main browser, and tend to use Firefox for experimental stuff like this.

To set a proxy select Tools -> Options then Advanced, Network, Settings:

Once there you want to select Manual proxy configuration and define a SOCKS Host of localhost:9999 with No Proxy for localhost, 127.0.0.1:

That’s it :)

When you OK back to the browser you should now be connected to the web via the EC2 instance. Don’t forget to shut it down when you’re done and revert Firefox back to ‘No proxy‘.

NB I’ve read that sites within the Amazon cloud (like the AWS site itself) may not behave as expected when your source IP is within EC2. I can say for sure that it works fine with the Amazon Kindle store (though as soon as I sign in I get pointed back to the .co.uk site)

NB2 I’m well aware that there are other means/services for doing this. EC2 is probably optimal for short term low bandwidth usage. Other approaches (see below) may be better if you want to move a lot of material around over a longer period of time.

Todo

This is fine for browsing and other stuff that lets you define a SOCKS proxy. There are circumstances where that’s not enough, and I’ll post some other day on how something like OpenVPN might be used to do this (or you could read @samj’s version)

Updates

Update 1 – 13 Dec 2010 – I got sick of fiddling with Proxy settings in Firefox, so I’ve now started using Wyzo for all of my proxied browsing. It’s pretty much a skinned Firefox, so the user experience is the same, but it saves me much clicking.

Update 2 – 14 Dec 2010 – Wyzo set of my anti malware software (I use Prevx). I’m pretty certain that it’s a false positive, but when you list out the things that a browser does to a machine it looks pretty evil (and I am disappointed that it made itself default browser without asking – that’s just rude, but also easy to fix):

WYZO.EXE has been seen to perform the following behavior:

  • The Process is packed and/or encrypted using a software packing process
  • This Process Deletes Other Processes From Disk
  • Can communicate with other computer systems using HTTP protocols
  • Executes a ProcessCreates a TCP port which listens and is available for communication initiated by other computers
  • Registers a Dynamic Link Library FileThis process creates other processes on disk
  • Can make outbound communication to other computers, IM chat rooms and other services using IRC protocols
  • Communicates with other computers using FTP connections
  • Executes Processes stored in Temporary Folders
  • Adds products to the system registry
  • Writes to another Process’s Virtual Memory (Process Hijacking)
  • Adds an ActiveX component changing or modifying the function of your browser
  • Deletes an ActiveX componentWYZO.EXE has been the subject of the following behavior:
  • Created as a process on disk
  • Executed as a Process
  • Terminated as a ProcessHas code inserted into its Virtual Memory space by other programs
  • Deleted as a process from disk
  • Executed by Internet Explorer
  • Changes to the file command map within the registry

Update 3 – 14 Dec 2010 – I should also mention that I’ve had issues with PeerBlock getting in the way of connections to EC2. For some reason somebody involved in maintaining the IP blacklist has decided that one of the Amazon ranges is ev1L. Now that I’m aware of the issue it’s simple to put an exception in place and restart my Putty session.

Update 4 – 30 Jun 2011 – After a few machine rebuilds and the discovery of FireFox profiles I’ve stopped using Wyzo. I’ve used ‘firefox.exe -ProfileManager’ to create a new profile called Proxy, and then set that profile up to use my proxy. Then I created a shortcut to FireFox with ‘-p Proxy’ appended (and pinned it to the taskbar).


As I’ve spent more time with my Kindle I’ve been paying more attention to eBook prices. My conclusion is that zero weight (or perhaps just the novelty of eBooks) is a feature, and one that the supply chain thinks is worth a premium.

Pricing

Typical pricing for eBooks seems to be the same (or just below) the price of a hardcover book for a new release, and similarly just below the cost of the paperback when that comes along a little later. There are some egregious exceptions, but that’s the typical pattern. It’s a pattern that makes intuitive sense to most buyers, as eBooks don’t incur physical printing and shipping costs (though they may attract taxes such as VAT that don’t apply to dead tree books[1]).

Can’t buy used

What I can’t do with an eBook is buy a used one, which is a shame, as presumably condition would always be ‘excellent’ – no scuffed corners, yellowed edges or coffee stains. There’s also no effective grey/discount market for ‘new’ ebooks – DRM and tight control of the channel put paid to that. This has some important implications for the long tail, which are probably best illustrated by an example. I’ll use Charles Stross’s Halting State to illustrate this one:

The Kindle edition seems good value against the paperback at 22p cheaper, but by going the dead tree route I could have a new paperback from elsewhere for £3.30 (almost a pound less than the eBook) or a used one in ‘Very Good’ condition for £2.76 (a saving of £1.51 [35%] against the eBook). So if all I want to do is read the words then there are clearly cheaper routes than the eBook.

But I have a fancy new toy in my bag, and I don’t want to lug around a space gobbling weighty chunk of dead tree, so maybe I will swallow that eBook premium (though not this time – I have enough books that I haven’t read, so I’m not in the market for eBooks of stuff that I’ve read already[2] – I was only even looking at Halting State as I was recommending it to a friend).

An international detour

Something that I’m starting to find quite annoying is the opacity of eBook pricing from country to country. If I browse the Amazon.com site (from the UK) I can’t see what an eBook costs in the US, I just get:

and:

So I have price transparency for many varieties of paper versions, but not the eBook (and it seems that the UK price is OK for a $1.5/£1 FX rate).

I sent the US link to my friend (in the US) and he gets:

and:



So the price for the Kindle edition that I can’t see, and he can’t buy is $6.99 (~£4.67 – a little more than the UK version). Confused? I am.

Amazon say that it will (soon) become possible to buy stuff for Kindle on people’s wish lists, so maybe these issues will go away. but I’m not holding my breath.

Back on topic

Which brings me back to my original point. I want stuff on my Kindle (and want other people to be able to buy wish list stuff for my Kindle) because I think zero weight probably is a useful feature.

Notes

[1] A prize for anybody who can offer a rational explanation why pure information in digital form is taxable, whilst information pressed onto smashed up dead trees isn’t?
[2] There’s an exception here for reference books, which typically aren’t read cover to cover in the first place. I recently bought the eBook versions of pretty much everything I’ve ever owned from O’Reilly that’s made it to the new format (using their $4.99 eBook upgrade offer).


Document management sucks! There – I said it. I challenge you to prove me wrong.

I haven’t yet found a document management system (DMS) that’s fit for purpose, and I think I know why.

It’s not about the technology. Documentum might hark from the client server era, and Alfresco trumps that with its SOA, but these are implementation details that matter not to the user.

Paper is two dimensional

It’s about the metaphor. Specifically the dimensionality of the metaphor. Pieces of paper are 2D, and so are document management systems. This makes sense in the physical world. I can only put my piece of paper in one place, which I might call a file or a folder or whatever. Computer file systems copy this metaphor, and document management systems copy it again – they just call a folder a workspace to make it sound more collaborative.

The real world is multidimensional

And we have IT abstractions that can be too. When I write an engagement letter to a client I shouldn’t be forced to ‘file’ it in a ‘folder’ called ‘engagement letter’ or ‘clientX’. I should be able to give it multiple attributes (tags), and virtual ‘folders’ can be assembled from those attributes. Thus if I want to see a library of engagement letters I select that tag, if I want to see all of the stuff relating to clientX I choose their tag.

‘Oh’, I hear you cry, ‘tagging takes effort, and people will forget’. That’s a real problem, but I think it’s solvable. We stopped trying to categorise everything into some massive (2D) taxonomy a long time ago, and decided that search would fill the gap. Search is good when it works, but it can be disorderly – hence the glib one liner ‘why are you searching when you should be finding’. The right way to do this is to make search part of the process at the front end rather than the back end. Clippy returns – ‘it looks like you’re saving an engagement letter to clientX – shall I tag that “engagement letter” and “clientX” for you?’.

The Social Aspect

We’ve seen great usage of tags already in social bookmarking sites like del.icio.us, so why not bolt on that functionality to document management (after all a ‘web page’ and a ‘document’ are essentially variations of the same thing). This raises the question of whose tags – my tag, your tag, the company tag. But who cares – this is what educated suggestions can help with, particularly when search can identify similarities with other documents. As JP wrote earlier in the week, Social Objects are important, sadly I fear that document management systems are anti-social. A DMS may provide a ‘shared space’ for a document as a social object, but it fails to provide rich support for the other activities that should be taking place. Yes, there’s metadata in a DMS, but not typically the sort of open and collaborative metadata that JP is referring to. This is where I can get excited some more about initiatives like Open Bookmarks (which is best described here). Not only does this potentially solve the state synchronisation issues I was concerned about last month, but it provides the perfect platform for social interactions around documents – a means to provide curation.

Security

This is an area that can’t be ignored. As soon as you get into a conversation about accessing documents you’re quickly into a conversation about preventing access to documents. I think this isn’t as hard a problem as people make out. Identity Management practices can be applied equally to people and the documents they interact with. People have attributes (like ‘Director’ and ‘works in HR’ and ‘based in ‘Switzerland’) and these can be synthesised into abstract roles (like ‘Swiss HR Directors’). Documents also have attributes (tags) that can be used to provide not just logical views as discussed above, but logical groupings for the purpose of security policies (for example all documents relating to ClientX can only be seen by the ClientX project team, except NDAs and engagement letters, which are also visible by legal). The policy here provides a mapping between document views and people roles, and should be fairly self explanatory (and easy to audit).

Now can somebody please build this for me?

PS This is yet another reason why we shouldn’t have software patents. Firstly this whole thing is obvious (at least to me), and secondly Xerox have this, but in their usual style appear to have done precious little with it. All that’s achieved here is blocking a startup from implementing (or robbing them blind later for ‘infringement’).


I often find myself having to rename a bunch of media files. This would be easy if it was just a matter of finding ‘foo’ and replacing ‘bar’. Sadly though I regularly have a list of numbered ‘foo’s that I want to be a different numbered ‘bar’s (e.g. foo_04 -> bar_01, foo_05 -> bar_02).

Normally I just suck this up and deal with the problem by hand, but as I’ve been trying to learn Python I thought I’d dust off my text editor and knock up a little script.

Here’s the result – incdec.py:

"""
Author: Chris Swan
Date:   5 Oct 2010
Updated with suggestions from Tim Swan: 6 Oct 2010

with thanks to Matt Weber (http://www.mattweber.org/2007/03/04/python-script-renamepy/)

Increment or decrement file names
"""

import os
import sys
from optparse import OptionParser

def ProcessFiles(options):

    # Set the offset to increment or decrement
    if options.inc:
        offset=options.offset
    else:
        offset=0-options.offset

    # Get directory to work on
    if options.directory:
        path = options.directory[0]
    else:
        path = os.getcwd()

    # Create a list of files in the directory
    fileList = os.listdir(path)

    # Reverse the list so that we don't get file name collisions
    if offset > 0:
        fileList.reverse()

    # Iterate across the list of files
    for file in fileList:

        # Get filename and the extension
        name, ext = os.path.splitext(file)
        oldname = os.path.join( path, name+ext )

        # Replace - first step
        if options.replace:
            for vals in options.replace:
                name = name.replace(vals[0], "*")
                replace = vals[1]

        # Extract digits from filename
        ndigits = ''.join([letter for letter in name if letter.isdigit()])
        # and the residual letters
        nletters = ''.join([letter for letter in name if not letter.isdigit()])

        # Process the inc/dec
        if ndigits != '':
            #Decrement the digits
            newdigits = str(int(ndigits)+offset)
            # and replace any zeros that may have been stripped by the int operation
            zeropad=len(ndigits)-len(newdigits)
            while zeropad>0:
                newdigits = "0"+newdigits
                zeropad=zeropad-1
        else:
            newdigits=""

        # Replace - second step
        if options.replace:
            nletters = nletters.replace("*", replace)

        # Create the new name
        newname = os.path.join( path , nletters + newdigits + ext )
        try:
            # Check for verbose output
            if options.verbose:
                print(oldname + " -> " + newname)
            # Rename the file
            os.rename(oldname, newname)
        except (OSError):
            print >>sys.stderr, ("Error renaming "+file+" to "+newname)

if __name__ == "__main__":
    """
    Parses command line and renames the files passed in
    """
    # Create the options we want to parse
    usage = "usage: %prog [options]"
    optParser = OptionParser(usage=usage)
    optParser.add_option("-v", "--verbose", action="store_true", dest="verbose", default=False, help="Use verbose output")
    optParser.add_option("-i", "--inc", action="store_true", dest="inc", default=False, help="Increment")
    optParser.add_option("-o", "--offset", type="int", dest="offset", default="1", help="Offset number")
    optParser.add_option("-d", "--dir", action="append", type="string", nargs=1, dest="directory", help="Directory to work on if not PWD")
    optParser.add_option("-r", "--replace", action="append", type="string", nargs=2, dest="replace", help="Replaces OLDVAL with NEWVAL in the filename", metavar="OLDVAL NEWVAL")
    (options, args) = optParser.parse_args()

    # Process files
    ProcessFiles(options)

    # exit successful
    sys.exit(0)

It runs in Python 3 (tested with 3.1.2). By default it will decrement the number on a filename by 1 in the present working directory. There are a bunch of options:

  • -i will increment numbers rather than decrement
  • -o will allow an offset other than 1
  • -d allows a directory other than the present working directory to be used
  • -r allows a text search/replace on the filename (I’ve been careful to allow this to contain numbers on both parts and not get in the way of the renumbering process)
  • -v gives verbose output

Use at your own peril.

Please be gentle in the comments – this is my first real Python program.

Update 6 Oct – I modified some of the code in light of Tim’s comments below (particularly to deal with directory delimiters on different OSes). I also spotted an issue with how I’d handled replace. My original effort will be maintained here for posterity.


Despite knowing that ‘free 3G’ is a trap I still ordered a Kindle 3G. There was nobody home when the postman came yesterday, so I had to pick it up from the depot this morning before rushing for my train (depot opens at 0800, train leaves at 0803, I just made it).

This isn’t intended to be a comprehensive review, but there are plenty of those out there already. Here, some first impressions after less than a days use:

I like

  • That I didn’t need to charge it for 12 hours or anything silly like that. It works straight out of the box.
  • The screen – which really does need to be seen to be believed.
  • The size and weight. Small enough to hold and use in one hand, but not so small that it compromises the screen size.
  • That all of my O’Reilly DRM free eBooks work so well; and the Cory Doctorow stuff.
  • The hidden games (shift alt M).

I don’t like

  • Having to use the Sym button and sub menu to put in numbers. Why doesn’t the guide just tell you that alt+top row gives you 1-0?
  • That the AudioBook player can’t sync with my iPod (but I saw that one coming). Worse still it only offers +/- chapter or +/- 30s for navigation. That’s going to be VERY annoying if I listen to half a chapter on something else.
  • It sometimes seems to lock up if I press the back page button repeatedly.
  • That I can’t eject it when connected by USB to my Windows 7 Tablet, so it can’t be charged and used at the same time (apparently this isn’t a problem with XP machines). Update 1 Oct The Kindle can be charged and used at the same time if you eject the removable drive from Windows Explorer rather than ejecting the device from Safely Remove Hardware and Eject Devices.

The browser

I expected it to be rubbish, and I’ve not been disappointed. It will certainly do for emergency email checking and maps, but I’m not sure that I want to use Google Reader on it for much time. I’ve heard from one happy yachtsman who likes it for weather reports, lets hope that the ‘free 3G’ holds out a little longer for him (or that at least roaming data isn’t too pricey when the dark day comes).

I’ve not tried yet

  • Reading a PDF
  • Buying an ebook from the Kindle store.

Conclusion

So far I’m pretty happy that I bought it. More to follow later once I’ve put a few miles on the clock.


This is my third post in a series looking at how federated identity has becoming a reality (I first looked at Twitter, and then Google).

Before we get started

I kind of liked Facebook in the early days that I used it, but frankly I never expected it to last. I thought that like the social networks before it (MySpace anyone) it would bud, flower and die. In my view we’re now way past the point when it should have died, but the alternative just hasn’t appeared on the scene. For me the madness peaked with FarmVille, and since then I’ve repeatedly considered FaceBook suicide; especially as each new abuse of privacy has materialised. So… I’m not a fan, but still a (grudging) user of the ecosystem. Consider me a hostile witness.

The user experience

Much like Google an Twitter the initial contact with the user is a sign in with Facebook button. Assuming that they’re already signed into Facebook in another tab on their browser this should get them straight in[1].

The next bit is where it all usually goes wrong for me (and I either use another identity system, choose the username/password option or give up entirely). This is where the site that you’re connecting to tells you what it’s going to do with/to your Facebook account:

So some app that I’ve never used before wants to access a whole bunch of my data (at any time), and post as me, just so that I don’t have to remember another password. Not a fair trade. I’ve written about similar issues with the Google Apps Marketplace, but there’s a desperate need here for finer grained control.

Maintenance

You can review the apps using your Facebook ID by going to Account > Application Settings.

Under the hood

The original Facebook Connect has become deprecated in favour of OAuth, but developers still need to interact with the proprietary Graph API rather than something more open/standard such as OpenID. This recent Hacker News thread explores the pros and cons of this in some detail[2].

Persona

People are only supposed to have one Facebook account, and I’ve heard the Facebook team talk through the processes that they use to seek and destroy alternate personae. So Facebook doesn’t just not support persona – it actively discourages it. There’s no strong authentication either.

Overall

I’m not a fan, but I can see how people can get sucked into using it. It’s good in terms of not having to remember another set of credentials, but it’s bad in terms of all the (potentially) bad things that Facebook and its partners are doing with personal data. Hopefully it doesn’t discredit the whole concept of federated ID for consumers[3].

Next instalment… the rest.

Footnotes

[1] I mostly consume Facebook via Tweetdeck these days, and I must say that I find it very annoying to have to sign into Facebook just to read links that my friends have posted as I typically don’t have a Facebook tab open in Chrome.
[2] I was particularly amused by the suggestion that users might be obliged to pay a cash fee to register if they weren’t engaging in the ‘social contract’ implied by using Facebook.
[3] It seems to be popular, but still behind Google – Google Winning Sign-In War, But Facebook Close Behind.