Page 1 of 1

S3 Caching Proxy in Pure Python

Posted: Sun Jul 30, 2023 5:00 pm
by harlanji
The past year of development has been interesting because I've been using a cheap Windows laptop in S Mode, meaning I can only run binaries available in the App Store. No NodeJS, no VM, no Linux-in-Windows, etc.. While I can turn off S Mode at any time and run any binary, I prefer to leave it in this state. It's a good challenge.

A common pattern for S3 backed web apps is to cache requests to S3 and proxy requests to the app from an edge server, often Nginx or HAproxy, using a configuration like this. But like a few cases in the past, one needs to improvise and do the same thing from scratch due to inability to just run Nginx locally.

Having a private S3 bucket powered by Minio has made me want to bake bucket support into all the apps I've built that write to their local data directory. Doing so is pretty easy, just update the Read and Write code to hit the proxy and warm it up. I could write a little library for use by all my apps.

In Hogumathi I built a little proxy to cache media files and save them to a local cache, to save bandwidth upon repeat views (scrolling the timeline) and enable offline viewing. That code isn't readily available in a Git repo, but I found a similar example for illustration. One extra step I take is storing the response headers along side the binary response data; the Content Type header can be critical and we can add additional headers in the X- namespace such as cache key if needed.

Code: Select all

      scheme = self.headers.get('X-Scheme', 'https')
      # ...
      m.update(scheme.encode('utf-8'))
      # ...
            with open(cache_filename + '.headers', 'wt') as headers_output:
                cache_headers = dict(resp.headers)
                cache_headers['X-Cache-Path'] = self.path
                cache_headers['X-Cache-Scheme'] = scheme
                
                json.dump(cache_headers, headers_output, indent=2)
                
          # ...

          with open(cache_filename + '.headers', 'rt') as cached_headers:
                cache_headers = json.load(cached_headers)
                for h, v in cache_headers.items():
                  self.send_header(h, v)
Here's a little Activity Diagram I drew:

Image

A couple features that can be bolted on top of this are a UI to manage the cache, such as evict files to save space or warm objects before going offline.

Using PaaS services like Vercel or Lambda to serve web apps is increasingly common, and they come without a persistent disk cache... tho often there is a temporary disk that can be used to do things like receive and process uploads. The local caching proxy is a middle ground, where we have control over persistence and may want to prune or pre-populate it for various use cases. HTTP is great because this type of use case is baked in, we just need to tune things property and know the primitives.