Converting docx Files

April 16th, 2008

I’m working on an OOXML implementation in Python and found this handy utility for converting docx files to rtf.

Docx2Rtf

It seems to open docx files that Word complains about, but at least it let’s me know that I am on the right track. Also, it runs under Wine on Linux, so there is no need for a virtual or non-virtual machine running Windows.

cw

Command History

April 12th, 2008

Since all the cool kids are doing it …

Work laptop

christian@yga-dowski:~$ history|awk '{a[$2]++ } END{for(i in a){print a[i] ” ” i}}’ |sort -rn|head
82 sudo
68 vim
51 ls
49 cd
48 exit
20 hg
16 rm
16 ipython
14 py.test
10 ping

Apparently I do a lot of exiting. I just started using Mercurial for local revision control, hence the presence of hg.

Dev server

(where I actually do most of my work)

cmw@watson:~/svn/g2wc$ history|awk '{a[$2]++ } END{for(i in a){print a[i] ” ” i}}’ |sort -rn|head
158 vim
81 make
70 svn
55 rm
34 ls
33 cd
15 sudo
11 exit
7 kinit
7 htop

The make commands encapsulate many calls to python and py.test.

Reading Chunked HTTP/1.1 Responses

April 2nd, 2008

For work today I wanted a way to iterate over an HTTP response with chunked transfer-coding on a chunk-for-chunk basis. I didn’t see a builtin way to do that with httplib. It supports chunked reads but you have to specify the amount that you want to read if you don’t want it to buffer. I just wanted it to read and yield each chunk that it received from the server.

For my first crack at it I really just tried to use the httplib basics:

import httplib
 
conn = httplib.HTTPConnection('localhost:8080')
conn.request('GET', '/')
r = conn.getresponse()
data = r.read(10)
while data:
    print data
    data = r.read(10)

That worked but since I won’t know the chunk size in real-life, I would probably get output similar to this:

Chunk 0
Ch
unk 1
Chun
k 2
Chunk
3
Chunk 4
...

I really wanted that chunk-for-chunk iteration. After taking a look at the very readable httplib source this evening, it wasn’t very hard to accomplish. I basically just took the httplib.HTTPResponse._read_chunked method and modified it to be a generator. I subclassed HTTPResponse and stuck my generator in an __iter__ method. Behold; now you can do this sort of thing:

if __name__ == "__main__":
    import httplib
    import iresponse
    conn = httplib.HTTPConnection('localhost:8080')
    conn.response_class = iresponse.IterableResponse
    conn.request('GET', '/')
    r = conn.getresponse()
    for chunk in r:
        print chunk

With nice results like this:

Chunk 0
Chunk 1
Chunk 2
Chunk 3
Chunk 4
...

You can download the iresponse module from my projects site. There is also a small CherryPy application that serves some data with chunked transfer-coding in case any of you want to fiddle with it.

cw

My PyCon 2008 Post

March 27th, 2008

WARNING: this is YAPAP (Yet Another Post About PyCon), and a late one at that. Click to read more, else move along fair netizen.

Read the rest of this entry »

Life Rundown

February 13th, 2008

This blog has probably been pretty boring lately for those of you who aren’t interested in my technical musings. The main reason for that is because there has been too much junk going on for us these days. I kind of had two options:

  1. Whine about our problems.
  2. Sugar coat our problems.

Ok, so I probably had more options than that, but I think I would have gravitated toward one of those two things.

In the spirit of not doing either of those things, here is a cold, bulleted list of some of the junk from the past few months so that those of you who are interested but not in the know can be more so in the know. In no particular order:

  • Moving
  • Death in the family
  • Having two houses (and two mortgages!)
  • Broken bones
  • All of a sudden having to make our house handicap accessible for my mother-in-law
  • Our 1 year-old turning 2 and taking it seriously

Ok, there you have it. Maybe now that that is out there I can get on with some more normal life-update sort of posts. We’ll see.

cw

Code Farming

February 13th, 2008

I read a good article that gave an Organic Metaphor for software development. Here is a quote:

As I look at what I actually do in my cramped little cubicle, I realize that my work is more akin to farming than construction. I spend my days cultivating information, and growing a program. I can see the evolution of my work as it needs to scale through different platforms.

I really resonate with that description - writing software is very fluid and organic. Code is malleable and frequently changing. During development, there aren’t necessarily rigid beams and permanent fasteners like in construction.

Perhaps lending credence to both the architectural and organic software development metaphors, the article closes with this thought:

The term “System Architecture” is a more fitting in the world of packaged software than it is for the IT department or Internet development. With packaged software, a group of programmers create one major release that they will distribute to hundreds of thousands of users. Since there is no method for evolution, this type of software has much more rigid. Of course, with the Internet explosion, packaged software will take a backseat to organic programming.

Yeah. The engineering/architectural and the organic metaphors probably both have their places. Packaged, physically distributed software probably requires more engineering. The organic metaphor better describes the survival of the fittest (or survival of the first-to-market) and rapid evolution that is possible (and required) in the internet ecosystem.

cw

Ohio has been Indexed!

February 7th, 2008

Big pharma hearts Ohio

If only there was such a thing as “cloud power” - we’d tear solar power up.

cw

SimpleParse Presentation and Other ClePy Notes

February 5th, 2008

I gave a brief presentation on SimpleParse at the ClePy meeting last night.

The presentation slides and a few example files are available on my site.

Matt Wilson shared some cool stuff about IPython and twill.

IPython seems like one of those apps where you can always learn one more time-saving trick. I learned a few more last night even though I use it every day.

I’ve never used twill and I’m still not super-compelled to do so. It seems like testing with something like Selenium will pick up on any errors that twill might catch (and more since it uses a real browser). The one advantage that I am sure twill has over Selenium is speed. I guess it might fit in there between library-level unit tests and high-level tests like those using Selenium.

cw

SimpleParse Plug

December 19th, 2007

I’ve been doing more parsing stuff at work lately. For my latest project I’ve been using the SimpleParse library. It has quickly overtaken PLY as my Python parsing library of choice.

Here’s a simple calculator example using SimpleParse. It does basic arithmetic and allows you to store values in single letter variable names. It basically just validates the line that you enter and then either evals or execs it as Python code.

#!/usr/bin/env python
from simpleparse.parser import Parser
from simpleparse.common import numbers
from simpleparse.error import ParserSyntaxError
 
grammar = '''
command     := !, assign/expr
assign      := varname, ts, '=', !, ts, expr
expr        := (lpar?, 
                ts, operand, ts, (op, !, ts, operand, ts)*, ts, 
               rpar?)+
lpar        := '('
rpar        := ')'
operand     := lpar?, ts, number/varname, ts, rpar?
varname     := [a-z]
op          := [+-*/]
<ts>        := [ \t]*
'''
 
class Calculator(object):
    def __init__(self):
        self.scanner = Parser(grammar, 'command')
 
    def parse(self, command):
        try:
            success, subtags, nextchar = self.scanner.parse(command)
        except ParserSyntaxError, e:
            return str(e)
        else:
            if subtags[0][0] == 'expr':
                try:
                    return eval(command)
                except Exception, e:
                    print "Error:", e.args[0]
            elif subtags[0][0] == 'assign':
                exec command in globals()
 
if __name__ == '__main__':
    calc = Calculator()
    def prompt():
        return raw_input('> ')
 
    command = prompt()
    while command != "quit":
        result = calc.parse(command)
        if result:
            print result
        command = prompt()
</ts>

There it is, in all of its uncommented glory.

Here’s an example session with the calculator:

> 2 + 2
4
> x = 4
> 8 * x
32
> y = 8 * x
> y
32
> 5 +
ParserSyntaxError: Failed parsing production "expr" @pos 3 (~line 1:3).
Expected syntax: operand
Got text: ''
> import shutil
Error: invalid syntax
> quit

Here are some of my favorite things about SimpleParse.

The entire grammar is declared in the EBNF.
This is a breath of fresh air coming from PLY, where the grammar rules are scattered amongst the action-code (in docstrings).

It has a clean API.
The API is very succinct, and thus you get to know it fairly quickly. Creating Processors for parsed text is quite straightforward.

It’s Fast.
It’s built on the fastfastfast mx.TextTools library from eGenix. It converts your EBNF grammar to mx.TextTools tag tables and pushes the heavy lifting off to the mx.TextTools tagging engine.

Some stuff that I found odd includes…

No Simple Tutorial.
Maybe I’ll do something about that using the calculator example above. The library is really great, but there is a fairly steep learning curve at this point due to the lack of a basic tutorial.

No Separate Lexing Stage.
Unlike PLY and other traditional parser generators, you don’t generate a token stream and then parse that according to your grammar. SimpleParse generates an mx.TextTools result tree which is then processed (typically) by SimpleParse DispatchProcessor subclass.

Error Handling
SimpleParse seems to like to just stop parsing when it reaches a syntax item that it can’t parse. You need to use a special token (an exclamation point, !) in your grammar to denote where you want to get picky about syntax. Although it is a bit tricky, it seems quite flexible so far.

Whitespace Handling.
There is no way (that I know of) to completely ignore certain characters. PLY let me define characters that were simply ignored, as if they weren’t even there. In the calculator above, I had to insert a ts production wherever I anticipate a potential series of tab or space characters in the input. However, by putting the production name in <angle brackets> the matched text is kept out of the result tree.

I’ll end this now by encouraging anyone interested in parsing a little (or big!) language using Python to check out SimpleParse. It’s clean API and speed make it very nice to work with.

cw

Political Plug

November 29th, 2007

Here’s an article from the Economist on foreign policy and the race for the White House.

Foreign policy and the presidential race.

It’s an interesting article, but even more interesting is that the data in the poll came from the organization that I work for. Ok, maybe more interesting is a bit of a stretch. ;-)

So there you go - the fruits of my (and many others’) labor.

cw