Timeline

I love to scroll through old photos and reminisce the old times. However, a camera doesn't capture everything. It's not always at hand, nor is it always appropriate to use. There aren't always photos. This is why I keep a diary, and also why I keep the notebooks in which I scribble day-to-day notes. They offer a deeper insight into my day-to-day life.

I wanted to extend my photo timeline and turn it into a sort of enhanced diary, one that includes notes, blog articles, social media posts, browsing history, doodles, geolocation, conversations and more.

Okay, but what does it really do?

It collects all your crap, and it puts it on a timeline.

But... why?

I thought it would be neat.

“Why the hell would we want to [launch an anvil 100 feet in the air]? I get that a lot from women y'know. Women say why would you want to do that? and I don't know other than it just need launchin' sumpin' that wadn't intended to be launched.”

- World champion anvil shooter Gay Wilkinson

I always wanted to bring my travel diaries, photos and location history together on a single timeline. It's fun to pick a random date and see what I was up to, what (or who) I was into, and how I felt about it all.

For example, after my motorcycle crash in Tajikistan, I can see my mood swing through the week. A day after the crash, I'm angrily Googling my feelings about drivers in Tajikistan, but as the hours pass, the queries become more focused: how much do motorcycle forks cost? Where is the next garage? How far is Bishkek? I'm coordinating with the Suzuki dealership in Berlin. Then pictures of landscapes appear, and I lighten up. My diary talks of mountains and fellow travellers. The GPS logs put me in the small village where I met a kind Finnish couple. A few days later, there are photos (and a magazine article) of us riding together along the Chinese border. I have a big grin on my face.

It's fascinating to see memories so clearly rendered, and to flip through them so effortlessly.

In a separate problem space, I needed a better way to organise my computer, phone and servers backups. I hoped to collect all the data in one place, then make copies elsewhere. Since this timeline thing feeds off the same data, I thought I could kill two birds with one stone.

Or to reuse available hardware, a motorcycle with panniers full of SSDs.

How does it work?

This is better explained in the project's README.

In a nutshell, the backend processes data from Sources (for regular imports) and Archives (for one-off imports), and turns them into Entries (things on the timeline). It pulls data from my laptop, my phone, social media, RSS feeds, GDPR data exports, bank transaction logs etc. In some cases, the data is pushed from another system, like the geolocation recorder I wrote for it.

After all that data is converted into timeline Entries, it appears on the frontend. Some Entries are displayed as thumbnails, other as posts, or as points on a map. It's all up to the frontend.

The tech stack is fairly conservative. It's a VueJS frontend, and a Django backend. It runs inside Docker on Ubuntu on a 10 years old ThinkPad in a home-made rack under my desk (it's not full stack if you don't make your own rack!) It's the same ThinkPad that runs Nickflix.

I spared no expenses

Where does the data come from?

The most unsettling aspect of this project is the provenance of the data. Most of the half-million entries on the timeline comes from data Google and Facebook already had on me. The data comes from GDPR exports, and those are not even complete. I ended up deleting many website accounts while building this thing.

For example, Google had 8 years of geolocation history - over 50,000 data points in total - and years of emails, search queries, browsing history etc. Facebook and Telegram had tens of thousands of messages. That's on top of the content I willingly share on social media: tweets, status updates, comments, etc.

As I was exporting that data, I ended up closing many accounts, and scrubbing my personal data from public profiles.

Fortunately, there is also data I actually control: geolocation data from my photos, GPX tracks, posts from my blog, SMS message dumps, git commits, etc. That data is a lot easier to import, because it's not locked behind clunky APIs and data export tools. If I needed a better way to import data, I could just create it. For example, I added RSS feeds to my websites.

Challenges

Getting the data

Obtaining the data was often the biggest challenge. I wanted to make the data imports automatic and repeatable, and only rely on manual imports when strictly necessary. That usually wasn't possible. When a service becomes large enough, it restricts access to its APIs, and holds your data hostage.

Since June 2019, Google Photos doesn't sync with Google Drive. It became really difficult to get photos off my phone, and on my computer. Google Takeout still lets you export all your photos, but it's a rather clunky, manual process. Twitter makes it hard to get API credentials nowadays, but I had old ones I could reuse. Reddit's API only returns the last 1000 comments, so it's missing about a decade worth of comments. Many websites just didn't offer any API at all.

If it wasn't for GDPR forcing these websites to give you access to your own data, this project could never get off the ground.

Ironically, the EU's open banking initiative, which forces banks to open your data to third-party services (like accounting software), does not actually allow you to access your own banking data. There is no room in this grand scheme for people who just want their own transactions in JSON format. Intermediary services don't offer a pricing tier for individuals. Fortunately, my bank still lets me export my transactions in CSV format, or my entire data in JSON format. However, there's no way to automate the process. I have to manually export my transactions every month, and import them as an Archive.

For small, one-off data imports, I converted the data to JSON with throwaway scripts, and imported that. There's an Archive for that. For example, I had a backup of all the SMS messages I sent or received from 2013 to 2015, and hundreds of diary entries in text files and Google Keep notes. I converted those to JSON Entries, and imported them.

Time zones

Time zones are hard. I store Unix timestamps and print UTC dates to remove some headaches, but with some of the data, it's not that simple.

For example, EXIF metadata does not have time zone information. It just uses the camera's date. I could assume that the camera uses whichever time zone I live in, but it's clunky and rather inaccurate.

That's also an issue when displaying the data on the timeline. If I'm in Germany, everything I did in Canada is off by 6 hours, and everything I did in Kazakhstan is off by 5 hours in the other direction. I could use the surrounding geolocation entries to infer the time zone of each entry, but that's getting a tad complex. I decided to just close my eyes and pretend the problem isn't there.

Dates

It wasn't any easier to figure out when things happened, and where they should appear on the timeline.

A photo could appear in multiple places: on the date it was taken (if it's known), on the date the file was created (if it's known), and on every date the file was modified. All of those dates are inaccurate, if they're available at all.

Currently, I look at 3 things, in order of priority:

  1. The date written in the file name. For example, selfie - 2020-03-22.jpg. This is useful for notebook scans and other documents, since the file's creation date is not the physical document's creation date.
  2. The EXIF date. This is a semi-reliable indicator of when a photo was taken. It has no time zone information, and can be way off if the camera's clock isn't set properly, but it works really well for smartphone pictures.
  3. The file modification date. I can't use the file creation date, because it's not available on many filesystems.

Data formats

Parsing hundreds of thousands of files from dozens of different devices led to all sorts of interesting problems.

I had to remember to include .jpeg and .mpeg files, not just .jpg and .mpg. I also had to include many other file extensions I completely forgot about: .m4v, .mod, .mts, .hevc, .3gp, .raf, .orf and so on. Thankfully, ImageMagick and ffmpeg could convert make thumbnail out of those without breaking a sweat.

Some camera manufacturers ignore the standard EXIF date format and use their own, so I keep updating the EXIF date parser to accommodate them. I have found 3 different date formats so far.

Large files

Archives can be very large, and sometimes multipart. When playing with large files, you always encounter weird new issues.

I had to configure multiple application layers to allow large file uploads, and long request/response times, so that I could upload 5+ GB Google Takeout archives. I also had to clear space on my laptop just to test Google Photos backup imports. Even then, I still run into issues when uploading larger archives. The solution might be to implement multipart uploads, but that sounds like a pain in the lower back.

Importing large amounts of date through the browser is also troublesome. If you paste 5 MB of JSON in a text field, your browser will freeze, then crash.

When processing large files, you have to buffer the data. If you load it all in memory, you might quickly run out of memory. You also have to be frugal with CPU usage, because it runs on old hardware alongside other projects. I pulled many interesting tricks to speed up the backups. It's pretty fun to tackle those little challenges along the way.