A Stitch in Time Saves Nine

Have you ever wondered where common sense sayings come from? They come from a place of deep wisdom, I like to think. They encode something true about the world in a manner that could be effectively transmitted down generations. A bird in hand is worth two in a bush. Is this being transmitted down from when we were hunter-gatherers? I would love to know more. Part of the reason I bring it up is that, its not sufficient to wax eloquently about programming and not do actual programming.

The way we transmit information today is that we record them in research papers. Then we upload these to the internet and for a really hefty fee (or sailing the seas with a skulls and bones banner) you have the whole measure of human knowledge at your fingertips. Awesome.

As the publishing industry tried to wring out the maximum possible profits out of their journals, a sort of folk resistance sprung up where people began to upload their pre-prints to a server called ar$\chi$iv. So far so good.

There is a wealth on information and cutting edge research on the site and lots of books and tutorials too. It is one of my favourite websites to download interesting stuff from. However, I ran into a huge problem. I have a Downloads folder with many papers and for some reason that must have made sense to the website masters, they name the pdfs with an alphanumeric string with names like 2501.00663v1. Yikes. Don't get me wrong, I love downloading and I love to expand my horizons and one day (if I live to be a million years old) I am going to read all of the hundreds of pdfs I have downloaded from ar$\chi$iv. But, first I need to know what I am reading. Manually naming four to six hundred pdfs is not what I am getting paid for. Maybe if there was an arcane magic, some profane incantation that I can chant, or maybe some forbidden runes that I can inscribe and the names of the PDFs became the titles of the Paper? Oh, my left eye is twitching. This surely forebodes of some true evil.

But we are in tutorial heaven my friends. We are not in Hell and as Rorschach said, we are not trapped in these tutorials, these tutorials are trapped in here with us.

So the idea is pretty simple. There must be some library to read the metadata of the PDF that has its title and author names and then we can just rename the file to the title_author? Sounds like an epic win for us.

Of course just for kicks I downloaded the PDF standard and that is like a gazillion pages long, apparently you can run doom in a PDF file too. Who comes up with these deranged notions? What in the name of hell is this? Of course we have the people who escaped tutorial hell to thank already. There are a lot of PDF libraries that are open source and do all the heavy lifting for us and we can just call their APIs from one of the languages. We will investigate both functional and object oriented programming for this so I will be looking at coding this in Python (which I know sort of) and Haskell (which I am trying to learn sort of).

Do you ever wonder what stars actually are? People who escape tutorial hell become stars in the sky and guide us and navigate us across the code manifold. Imagine an abstract space full of tokens and stop sequences and unicode fragments and monkeys on typewriters writing reserved keywords and stringing them together. You take a path in this space and if the path connects (i.e if your construct compiles and runs), then you emerge into the light of Arda, bathed in the Goodness of Piety. Otherwise you are lost, condemned to totter from one tutorial to the next. Some of those gentle souls who escaped tutorial hell become open source developers and write libraries and bindings and maintain all the software that we need and run. Some become greedy moguls. Why does this happen? We will talk about that at an appropriate time. But there are indeed libraries that can read metadata. But what if we wrote a library to read metadata? Pfft..., how difficult could it really be? I, whose will was forged in the sea of a million unfinished tutorials embark on this challenge. The data is right there. The metadata is in the file, you just read it in binary mode and Bob's your uncle.

There is a very nice library called QPDF that is open source and Apache Licensed (which is a permissive license) that we can use for this purpose.