S3 Caching Proxy in Pure Python

Post a reply

Confirmation code
Enter the code exactly as it appears. All letters are case insensitive.
Smilies
:D :) ;) :( :o :shock: :? 8-) :lol: :x :P :oops: :cry: :evil: :twisted: :roll: :!: :?: :idea: :arrow: :| :mrgreen: :geek: :ugeek:

BBCode is ON
[img] is ON
[flash] is OFF
[url] is ON
Smilies are ON

Topic review
   

Expand view Topic review: S3 Caching Proxy in Pure Python

S3 Caching Proxy in Pure Python

by harlanji » Sun Jul 30, 2023 5:00 pm

The past year of development has been interesting because I've been using a cheap Windows laptop in S Mode, meaning I can only run binaries available in the App Store. No NodeJS, no VM, no Linux-in-Windows, etc.. While I can turn off S Mode at any time and run any binary, I prefer to leave it in this state. It's a good challenge.

A common pattern for S3 backed web apps is to cache requests to S3 and proxy requests to the app from an edge server, often Nginx or HAproxy, using a configuration like this. But like a few cases in the past, one needs to improvise and do the same thing from scratch due to inability to just run Nginx locally.

Having a private S3 bucket powered by Minio has made me want to bake bucket support into all the apps I've built that write to their local data directory. Doing so is pretty easy, just update the Read and Write code to hit the proxy and warm it up. I could write a little library for use by all my apps.

In Hogumathi I built a little proxy to cache media files and save them to a local cache, to save bandwidth upon repeat views (scrolling the timeline) and enable offline viewing. That code isn't readily available in a Git repo, but I found a similar example for illustration. One extra step I take is storing the response headers along side the binary response data; the Content Type header can be critical and we can add additional headers in the X- namespace such as cache key if needed.

Code: Select all

      scheme = self.headers.get('X-Scheme', 'https')
      # ...
      m.update(scheme.encode('utf-8'))
      # ...
            with open(cache_filename + '.headers', 'wt') as headers_output:
                cache_headers = dict(resp.headers)
                cache_headers['X-Cache-Path'] = self.path
                cache_headers['X-Cache-Scheme'] = scheme
                
                json.dump(cache_headers, headers_output, indent=2)
                
          # ...

          with open(cache_filename + '.headers', 'rt') as cached_headers:
                cache_headers = json.load(cached_headers)
                for h, v in cache_headers.items():
                  self.send_header(h, v)
Here's a little Activity Diagram I drew:

Image

A couple features that can be bolted on top of this are a UI to manage the cache, such as evict files to save space or warm objects before going offline.

Using PaaS services like Vercel or Lambda to serve web apps is increasingly common, and they come without a persistent disk cache... tho often there is a temporary disk that can be used to do things like receive and process uploads. The local caching proxy is a middle ground, where we have control over persistence and may want to prune or pre-populate it for various use cases. HTTP is great because this type of use case is baked in, we just need to tune things property and know the primitives.

Top