Thursday, May 08, 2003
I have two huge directories on my computer, both filled with hundreds and hundreds of physics papers. I tried to make this less daunting by keeping track of the papers in a formatted text file, with authors, titles, annotations, etc. -- but that's why there are two directories, 'old' and 'new,' and the majority of papers, in 'new,' haven't made it into the text file yet. Remember, hundreds. It's very daunting. And the papers are in many different formats. It got so bad that I couldn't find anything. Not only that, but I was terrified to even try. Something needed doing.
I tried a bunch of bibliography management tools, and the one most up to the task was of course an opensource program called BibDesk, that keeps track of them in a BibTeX file, and does lots of neat stuff like fast sorting by keywords, paper preview, presenting annotations in a frame below the list, etc. Really cool. However, there still remained the task of making an entry for each of the hundreds and hundreds of files. Eeek!
I sometimes do these things by hand, but this seemed ridiculous. And I've been wanting to learn Python, a fairly new scripting language. So, yesterday, I learned Python. Wow. It is the coolest language ever. I forgot how easy things could be in an interpretive language, where you can just try the code out as you go on a command line to see if it's doing what you want. And it almost always seemed to be doing exactly what I wanted. It is a very cool language. Everything I tried that I thought might just work, worked -- very unlike so many other languages.
Within two hours of starting to play with Python, I had written a script that would take a pdf paper from the 'new' directory, convert some of it to text, grab the paper's identifier number, use that number to grab the paper's abstract page off the online web arXive, parse that abstract to get the title, author, abstract, and number of pages if available, and append this info as a new BibTeX entry in BibDesk's data file. Heh. "Hello World" indeed! I'm not kidding, two hours. Everything just worked. After that, it was a piece of cake to write another script to parse my monstrous text file and convert the entries to BibTeX. And then another script to use those BibTex entries to grab the actual paper pdf's from the arXive, so the papers would all be in one nice format instead of variously compressed postscript files -- that script, "arxivsuck," is happily running now.