Timeline

I love to scroll through old photos and reminisce the old times. I took solace in those memories during the months of lockdown, when it became harder to forge new ones.

But there aren't always photos. Sometimes I just want to enjoy the moment, without the distraction of a camera. I keep a travel diary to plug the gaps and cover that which can't be photographed. They don't just show what I was looking at, but also what I was thinking about and doing at the time.

Then there are the tweets, blog posts, searches, doodles, movements and conversations. Together, they tell a much richer story than a few landscape pictures.

I wanted to bring it all together into an all-encompassing personal diary. It didn't exist, so I built it.

Okay, but what does it really do?

It collects all your crap, and it puts it on a timeline.

But... why?

I thought it would be neat.

“Why the hell would we want to [launch an anvil 100 feet in the air]? I get that a lot from women y'know. Women say why would you want to do that? and I don't know other than it just need launchin' sumpin' that wadn't intended to be launched.”

- World champion anvil shooter Gay Wilkinson

This project had been marinating in my mind for a while. I always wanted to bring my travel diaries, photos and location history together on a single timeline. It would be fun to pick a random date and see what I was up to, what (or who) I was into, and how I felt about it all.

In a separate problem space, I needed a better way to organise my computer, phone and servers backups. I wanted to collect all the data in one place, then make copies elsewhere. Since this timeline thing feeds off the same data, I thought I could kill two birds with one stone.

Or to reuse available hardware, a motorcycle with panniers full of SSDs.

How does it work?

This is better explained in the project's README.

In a nutshell, the server processes data from Sources (for regular imports) and Archives (for one-off imports), and turns them into Entries (things on the timeline). It pulls data from my laptop, my phone, social media, RSS feeds, GDPR data exports, bank transaction logs etc. In some cases, the data is pushed from another system, like the geolocation recorder I wrote for it.

After all that data is converted into timeline Entries, it shows up in the web interface. Some Entries are displayed as thumbnails, other as posts, or as points on a map. The web interface gets a bunch of entries, and it decides how to best display them.

The tech stack is fairly conservative. It's a VueJS frontend, and a Django backend. It runs inside Docker on Ubuntu on a 10 years old ThinkPad in a home-made rack under my desk (it's not full stack if you don't make your own rack!) It's the same ThinkPad that runs Nickflix.

I spared no expenses

Where does the data come from?

The most unsettling aspect of this project is the provenance of the data. Most of the half-million entries on the timeline comes from data Google and Facebook already had on me. The data comes from GDPR exports, and those are not even complete. I ended up deleting many website accounts while building this thing.

For example, Google had 8 years of geolocation history - over 50,000 data points in total - and years of emails, search queries, browsing history etc. Facebook and Telegram had tens of thousands of messages. That's on top of the content I willingly share on social media: tweets, status updates, comments, etc.

As I was exporting that data, I ended up closing many accounts, and scrubbing my personal data from public profiles.

Fortunately, there is also data I actually control: geolocation data from my photos, GPX tracks, posts from my blog, SMS message dumps, git commits, etc. That data is a lot easier to import, because it's not locked behind clunky APIs and data export tools. If I need a better way to import data, I can just create it. For example, by adding RSS feeds to my websites, and letting the timeline consume them.

Challenges

Getting the data

Accessing the data is often the biggest challenge. I want to make the data imports automatic and repeatable, and only rely on manual imports when I have no other choice. That usually isn't possible. When a service becomes large enough, it restricts access to its APIs, and holds your data hostage. You can only access it through their app, on their terms.

Take Google Photos for example. Since June 2019, it doesn't sync with Google Drive. You can only see your own photos through their app. Getting them onto a computer you own became really difficult. Google Takeout still lets you export all your photos, but it's a rather clunky, manual process. Twitter makes it hard to get API credentials nowadays, so the API is out of reach for most. Fortunately, I had old ones I could reuse. Reddit's API only returns the last 1000 comments, so it's missing about a decade worth of comments. Many websites just don't offer any API at all.

If it wasn't for GDPR forcing websites to give you access to your own data, this project could never get off the ground.

Ironically, the EU's open banking initiative, which forces banks to open your data to third-party services (like accounting software), does not actually let you to access your own banking data. There is no room in this grand scheme for people who just want their own transactions in JSON format. Intermediary services don't offer a pricing tier for individuals. Fortunately, my bank still lets me export my transactions in CSV format, or my entire data in JSON format. However, there's no way to automate the process. I have to manually export my transactions every month, and import them as an Archive. So be it.

For small, one-off data imports, I convert the data to JSON with throwaway scripts, and import that. There's an Archive type for that. I imported all the SMS messages I sent or received from 2013 to 2015 this way, and hundreds of diary entries I once kept in text files and Google Keep notes. Now they all sit neatly on the same timeline.

Time zones

Time zones are hard. I store Unix timestamps and print UTC dates to remove some headaches, but with some of the data, it's not that simple.

For example, EXIF metadata does not have time zone information. It just uses the camera's date. I could assume that the camera uses whichever time zone I live in, but it's clunky and rather inaccurate.

That's also an issue when displaying the data on the timeline. If I'm in Germany, everything I did in Canada is off by 6 hours, and everything I did in Kazakhstan is off by 5 hours in the other direction. I could use the surrounding geolocation entries to infer the time zone of each entry, but that's getting a tad complex. I decided to just close my eyes and pretend the problem isn't there.

Dates

It wasn't any easier to figure out when things happened, and where they should appear on the timeline.

A photo could appear in multiple places: on the date it was taken (if it's known), on the date the file was created (if it's known), and on every date the file was modified. All of those dates are inaccurate, if they're available at all.

Currently, I look at 3 things, in order of priority:

  1. The date written in the file name. For example, selfie - 2020-03-22.jpg. This is useful for notebook scans and other documents, since the file's creation date is not the physical document's creation date.
  2. The EXIF date. This is a semi-reliable indicator of when a photo was taken. It has no time zone information, and can be way off if the camera's clock isn't set properly, but it works really well for smartphone pictures.
  3. The file modification date. I can't use the file creation date, because it's not available on many filesystems.

Data formats

Parsing hundreds of thousands of files from dozens of different devices led to all sorts of interesting problems.

I had to remember to include .jpeg and .mpeg files, not just .jpg and .mpg. I also had to include many other file extensions I completely forgot about: .m4v, .mod, .mts, .hevc, .3gp, .raf, .orf and so on. Thankfully, ImageMagick and ffmpeg could convert make thumbnail out of those without breaking a sweat

EXIF data gave me so much trouble that it's worth its own post.

Large files

Archives can be very large, and sometimes multipart. When playing with large files, you always encounter weird new issues.

I had to configure multiple application layers to allow large file uploads, and long request/response times, so that I could upload 5+ GB Google Takeout archives. I also had to clear space on my laptop just to test Google Photos backup imports. Even then, I still run into issues when uploading larger archives. The solution might be to implement multipart uploads, but that sounds like a pain in the lower back.

Importing large amounts of date through the browser is also troublesome. If you paste 5 MB of JSON in a text field, your browser will freeze, then crash.

When processing large files, you have to buffer the data. If you load it all in memory, you might quickly run out of memory. You also have to be frugal with CPU usage, because it runs on old hardware alongside other projects. I pulled many interesting tricks to speed up the backups. It's pretty fun to tackle those little challenges along the way.

Zip files

Zip files can encode their file names in two ways: CP437, or unicode.

Each operating system does it wrong, but in a different way1. For instance, Mac OS encodes its zip files as unicode, but doesn't set bit 11 correctly, so Python (correctly) reads them as CP437, and garbles the non-ASCII characters in file names.

I wrote a quick and dirty workaround for Mac OS archives: if the file doesn't exist, encode the name as CP437 and check again. I'll think of something more clever if I ever switch to another OS.

Noise

My home server sits in my open living room. I don't want to hear the fans spin needlessly when I'm reading on the couch. This led to performance tweaks I would have otherwise overlooked, like this one. This CPU-intensive task used to take 3 minutes, and not it takes 20 seconds. It finishes before the fans spin up.