A Python and Preact app deployed on Heroku
December 13, 2019
2 comments Web development, Django, Python, Docker, JavaScript
Heroku is great but it's sometimes painful when your app isn't just in one single language. What I have is a project where the backend is Python (Django) and the frontend is JavaScript (Preact). The folder structure looks like this:
/ - README.md - manage.py - requirements.txt - my_django_app/ - settings.py - asgi.py - api/ - urls.py - views.py - frontend/ - package.json - yarn.lock - preact.config.js - build/ ... - src/ ...
A bunch of things omitted for brevity but people familiar with Django and preact-cli/create-create-app should be familiar.
The point is that the root is a Python app and the front-end is exclusively inside a sub folder.
When you do local development, you start two servers:
./manage.py runserver
- startshttp://localhost:8000
cd frontend && yarn start
- startshttp://localhost:3000
The latter is what you open in your browser. That preact
app will do things like:
const response = await fetch('/api/search');
and, in preact.config.js
I have this:
export default (config, env, helpers) => {
if (config.devServer) {
config.devServer.proxy = [
{
path: "/api/**",
target: "http://localhost:8000"
}
];
}
};
...which is hopefully self-explanatory. So, calls like GET http://localhost:3000/api/search
actually goes to http://localhost:8000/api/search
.
That's when doing development. The interesting thing is going into production.
Before we get into Heroku, let's first "merge" the two systems into one and the trick used is Whitenoise. Basically, Django's web server will be responsibly not only for things like /api/search
but also static assets such as / --> frontend/build/index.html
and /bundle.17ae4.js --> frontend/build/bundle.17ae4.js
.
This is basically all you need in settings.py
to make that happen:
MIDDLEWARE = [
"django.middleware.security.SecurityMiddleware",
"whitenoise.middleware.WhiteNoiseMiddleware",
...
]
WHITENOISE_INDEX_FILE = True
STATIC_URL = "/"
STATIC_ROOT = BASE_DIR / "frontend" / "build"
However, this isn't quite enough because the preact
app uses preact-router
which uses pushState()
and other code-splitting magic so you might have a URL, that users see, like this: https://myapp.example.com/that/thing/special
and there's nothing about that in any of the Django urls.py
files. Nor is there any file called frontend/build/that/thing/special/index.html
or something like that.
So for URLs like that, we have to take a gamble on the Django side and basically hope that the preact-router
config knows how to deal with it. So, to make that happen with Whitenoise we need to write a custom middleware that looks like this:
from whitenoise.middleware import WhiteNoiseMiddleware
class CustomWhiteNoiseMiddleware(WhiteNoiseMiddleware):
def process_request(self, request):
if self.autorefresh:
static_file = self.find_file(request.path_info)
else:
static_file = self.files.get(request.path_info)
# These two lines is the magic.
# Basically, the URL didn't lead to a file (e.g. `/manifest.json`)
# it's either a API path or it's a custom browser path that only
# makes sense within preact-router. If that's the case, we just don't
# know but we'll give the client-side preact-router code the benefit
# of the doubt and let it through.
if not static_file and not request.path_info.startswith("/api"):
static_file = self.files.get("/")
if static_file is not None:
return self.serve(static_file, request)
And in settings.py
this change:
MIDDLEWARE = [
"django.middleware.security.SecurityMiddleware",
- "whitenoise.middleware.WhiteNoiseMiddleware",
+ "my_django_app.middleware.CustomWhiteNoiseMiddleware",
...
]
Now, all traffic goes through Django. Regular Django view functions, static assets, and everything else fall back to frontend/build/index.html
.
Heroku
Heroku tries to make everything so simple for you. You basically, create the app (via the cli or the Heroku web app) and when you're ready you just do git push heroku master
. However that won't be enough because there's more to this than Python.
Unfortunately, I didn't take notes of my hair-pulling excruciating journey of trying to add buildpacks and hacks and Procfile
s and custom buildpacks. Nothing seemed to work. Perhaps the answer was somewhere in this issue: "Support running an app from a subdirectory" but I just couldn't figure it out. I still find buildpacks confusing when it's beyond Hello World. Also, I didn't want to run Node as a service, I just wanted it as part of the "build process".
Docker to the rescue
Finally I get a chance to try "Deploying with Docker" in Heroku which is a relatively new feature. And the only thing that scared me was that now I need to write a heroku.yml
file which was confusing because all I had was a Dockerfile
. We'll get back to that in a minute!
So here's how I made a Dockerfile
that mixes Python and Node:
FROM node:12 as frontend
COPY . /app
WORKDIR /app
RUN cd frontend && yarn install && yarn build
FROM python:3.8-slim
WORKDIR /app
RUN groupadd --gid 10001 app && useradd -g app --uid 10001 --shell /usr/sbin/nologin app
RUN chown app:app /tmp
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y --no-install-recommends \
gcc apt-transport-https python-dev
# Gotta try moving this to poetry instead!
COPY ./requirements.txt /app/requirements.txt
RUN pip install --upgrade --no-cache-dir -r requirements.txt
COPY . /app
COPY --from=frontend /app/frontend/build /app/frontend/build
USER app
ENV PORT=8000
EXPOSE $PORT
CMD uvicorn gitbusy.asgi:application --host 0.0.0.0 --port $PORT
If you're not familiar with it, the critical trick is on the first line where it builds some Node with as frontend
. That gives me a thing I can then copy from into the Python image with COPY --from=frontend /app/frontend/build /app/frontend/build
.
Now, at the very end, it starts a uvicorn
server with all the static .js
, index.html
, and favicon.ico
etc. available to uvicorn
which ultimately runs whitenoise
.
To run and build:
docker build . -t my_app docker run -t -i --rm --env-file .env -p 8000:8000 my_app
Now, opening http://localhost:8000/
is a production grade app that mixes Python (runtime) and JavaScript (static).
Heroku + Docker
Heroku says to create a heroku.yml
file and that makes sense but what didn't make sense is why I would add cmd
line in there when it's already in the Dockerfile
. The solution is simple: omit it. Here's what my final heroku.yml
file looks like:
build:
docker:
web: Dockerfile
Check in the heroku.yml
file and git push heroku master
and voila, it works!
To see a complete demo of all of this check out https://github.com/peterbe/gitbusy and https://gitbusy.herokuapp.com/
Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON
September 30, 2019
2 comments Redis, Nginx, Web Performance, Python, Django, PostgreSQL
Last week, I blogged about "How much faster is Redis at storing a blob of JSON compared to PostgreSQL?". Judging from a lot of comments, people misinterpreted this. (By the way, Redis is persistent). It's no surprise that Redis is faster.
However, it's a fact that I have do have a lot of blobs stored and need to present them via the web API as fast as possible. It's rare that I want to do relational or batch operations on the data. But Redis isn't a slam dunk for simple retrieval because I don't know if I trust its integrity with the 3GB worth of data that I both don't want to lose and don't want to load all into RAM.
But is it entirely wrong to look at WHICH database to get the best speed?
Reviewing this corner of Song Search helped me rethink this. PostgreSQL is, in my view, a better database for storing stuff. Redis is faster for individual lookups. But you know what's even faster? Nginx
Nginx??
The way the application works is that a React web app is requesting the Amazon product data for the sake of presenting an appropriate affiliate link. This is done by the browser essentially doing:
const response = await fetch('https://songsear.ch/api/song/5246889/amazon');
Internally, in the app, what it does is that it looks this up, by ID, on the AmazonAffiliateLookup
ORM model. Suppose it wasn't there in the PostgreSQL, it uses the Amazon Affiliate Product Details API, to look it up and when the results come in it stores a copy of this in PostgreSQL so we can re-use this URL without hitting rate limits on the Product Details API. Lastly, in a piece of Django view code, it carefully scrubs and repackages this result so that only the fields used by the React rendering code is shipped between the server and the browser. That "scrubbed" piece of data is actually much smaller. Partly because it limits the results to the first/best match and it deletes a bunch of things that are never needed such as ProductTypeName
, Studio
, TrackSequence
etc. The proportion is roughly 23x. I.e. of the 3GB of JSON blobs stored in PostgreSQL only 130MB is ever transported from the server to the users.
Again, Nginx?
Nginx has a built in reverse HTTP proxy cache which is easy to set up but a bit hard to do purges on. The biggest flaw, in my view, is that it's hard to get a handle of how much RAM this it's eating up. Well, if the total possible amount of data within the server is 130MB, then that is something I'm perfectly comfortable to let Nginx handle cache in RAM.
Good HTTP performance benchmarking is hard to do but here's a teaser from my local laptop version of Nginx:
▶ hey -n 10000 -c 10 https://songsearch.local/api/song/1810960/affiliate/amazon-itunes Summary: Total: 0.9882 secs Slowest: 0.0279 secs Fastest: 0.0001 secs Average: 0.0010 secs Requests/sec: 10119.8265 Response time histogram: 0.000 [1] | 0.003 [9752] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 0.006 [108] | 0.008 [70] | 0.011 [32] | 0.014 [8] | 0.017 [12] | 0.020 [11] | 0.022 [1] | 0.025 [4] | 0.028 [1] | Latency distribution: 10% in 0.0003 secs 25% in 0.0006 secs 50% in 0.0008 secs 75% in 0.0010 secs 90% in 0.0013 secs 95% in 0.0016 secs 99% in 0.0068 secs Details (average, fastest, slowest): DNS+dialup: 0.0000 secs, 0.0001 secs, 0.0279 secs DNS-lookup: 0.0000 secs, 0.0000 secs, 0.0026 secs req write: 0.0000 secs, 0.0000 secs, 0.0011 secs resp wait: 0.0008 secs, 0.0001 secs, 0.0206 secs resp read: 0.0001 secs, 0.0000 secs, 0.0013 secs Status code distribution: [200] 10000 responses
10,000 requests across 10 clients at rougly 10,000 requests per second. That includes doing all the HTTP parsing, WSGI stuff, forming of a SQL or Redis query, the deserialization, the Django JSON HTTP response serialization etc. The cache TTL is controlled by simply setting a Cache-Control
HTTP header with something like max-age=86400
.
Now, repeated fetches for this are cached at the Nginx level and it means it doesn't even matter how slow/fast the database is. As long as it's not taking seconds, with a long Cache-Control
, Nginx can hold on to this in RAM for days or until the whole server is restarted (which is rare).
Conclusion
If you the total amount of data that can and will be cached is controlled, putting it in a HTTP reverse proxy cache is probably order of magnitude faster than messing with chosing which database to use.
SongSearch autocomplete rate now 2+ per second
July 11, 2019
0 comments Django, Python, Nginx, Redis
By analyzing my Nginx logs, I've concluded that SongSearch's autocomplete JSON API now gets about 2.2 requests per second. I.e. these are XHR requests to /api/search/autocomplete?q=...
.
Roughly, 1.8 requests per second goes back to the Django/Elasticsearch backend. That's a hit ratio of 16%. These Django/Elasticsearch requests take roughly 200ms on average. I suspect about 150-180ms of that time is spent querying Elasticsearch, the rest being Python request/response and JSON "paperwork".
Caching strategy
Caching is hard because the queries are so vastly different over time. Had I put a Redis cache decorator on the autocomplete Django view function I'd quickly bloat Redis memory and cause lots of evictions.
What I used to do was something like this:
def search_autocomplete(request):
q = request.GET.get('q')
cache_key = None
if len(q) < 10:
cache_key = 'autocomplete:' + q
results = cache.get(cache_key)
if results is not None:
return http.JsonResponse(results)
results = _do_elastisearch_query(q)
if cache_key:
cache.set(cache_key, results, 60 * 60)
return http.JsonResponse(results)
However, after some simple benchmarking it was clear that using Nginx' uwsgi_cache
it was much faster to let the cacheable queries terminate already at Nginx. So I changed the code to something like this:
def search_autocomplete(request):
q = request.GET.get('q')
results = _do_elastisearch_query(q)
response = http.JsonResponse(results)
if len(q) < 10:
patch_cache_control(response, public=True, max_age=60 * 60)
return response
The only annoying thing about Nginx caching is that purging is hard unless you go for that Nginx Plus (or whatever their enterprise version is called). But more annoying, to me, is that fact that I can't really see what this means for my server. When I was caching with Redis I could just use redis-cli
and...
> INFO ... # Memory used_memory:123904288 used_memory_human:118.16M ...
Nginx Amplify
My current best tool for keeping an eye on Nginx is Nginx Amplify. It gives me some basic insights about the state of things. Here are some recent screenshots:
Thoughts and conclusion
Caching is hard. But it's also fun because it ties directly into performance work.
In my business logic, I chose that autocomplete queries that are between 1 and 9 characters are cacheable. And I picked a TTL of 60 minutes. At this point, I'm not sure exactly why I chose that logic but I remember doing some back-of-envelope calculations about what the hit ratio would be and roughly what that would mean in bytes in RAM. I definitely remember picking 60 minutes because I was nervous about bloating Nginx's memory usage. But as of today, I'm switching that up to 24 hours and let's see what that does to my current 16% Nginx cache hit ratio. At the moment, /var/cache/nginx-cache/
is only 34MB which isn't much.
Another crux with using uwsgi_cache
(or proxy_cache
) is that you can't control the cache key very well. When it was all in Python I was able to decide about the cache key myself. A plausible implementation is cache_key = q.lower().strip()
for example. That means you can protect your Elasticsearch backend from having to do {"q": "A"}
and {"q": "a"}
. Who knows, perhaps there is a way to hack this in Nginx without compiling in some Lua engine.
The ideal would be some user-friendly diagnostics tool that I can point somewhere, towards Nginx, that says how much my uwsgi_cache
is hurting or saving me. Autocomplete is just one of many things going on on this single DigitalOcean server. There's also a big PostgreSQL server, a node-express cluster, a bunch of uwsgi
workers, Redis, lots of cron job scripts, and of course a big honking Elasticsearch 6.
UPDATE (July 12 2019)
Currently, and as mentioned above, I only set Cache-Control
headers (which means Nginx snaps it up) for queries that at max 9 characters long. I wanted to appreciate and understand how ratio of all queries are longer than 9 characters so I wrote a report and its output is this:
POINT: 7 Sum show 75646 32.2% Sum rest 159321 67.8% POINT: 8 Sum show 83702 35.6% Sum rest 151265 64.4% POINT: 9 Sum show 90870 38.7% Sum rest 144097 61.3% POINT: 10 Sum show 98384 41.9% Sum rest 136583 58.1% POINT: 11 Sum show 106093 45.2% Sum rest 128874 54.8% POINT: 12 Sum show 113905 48.5% Sum rest 121062 51.5%
It means that (independent of time expiry) 38.7% of queries are 9 characters or less.
Build an XML sitemap of XML sitemaps
June 1, 2019
0 comments Django, Python
Suppose that you have so many thousands of pages that you can't just create a single /sitemap.xml
file that has all the URLs (aka <loc>
) listed. Then you need to make a /sitemaps.xml
that points to the other sitemap files. And if you're in the thousands, you'll need to gzip these files.
The blog post demonstrates how Song Search generates a sitemap file that points to 63 sitemap-{M}-{N}.xml.gz
files which spans about 1,000,000 URLs. The context here is Python and the getting of the data is from Django. Python is pretty key here but if you have something other than Django, you can squint and mentally replace that with your own data mapper.
Generate the sitemap .xml.gz
file(s)
Here's the core of the work. A generator function that takes a Django QuerySet instance (that is ordered and filtered!) and then starts generating etree
trees and dumps them to disk with gzip
.
import gzip
from lxml import etree
outfile = "sitemap-{start}-{end}.xml"
batchsize = 40_000
def generate(self, qs, base_url, outfile, batchsize):
# Use `.values` to make the query much faster
qs = qs.values("name", "id", "artist_id", "language")
def start():
return etree.Element(
"urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
)
def close(root, filename):
with gzip.open(filename, "wb") as f:
f.write(b'<?xml version="1.0" encoding="utf-8"?>\n')
f.write(etree.tostring(root, pretty_print=True))
root = filename = None
count = 0
for song in qs.iterator():
if not count % batchsize:
if filename: # not the very first loop
close(root, filename)
yield filename
filename = outfile.format(start=count, end=count + batchsize)
root = start()
loc = "{}{}".format(base_url, make_song_url(song))
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
count += 1
close(root, filename)
yield filename
The most important lines in terms of lxml.etree
and sitemaps are:
root = etree.Element("urlset", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9")
...
etree.SubElement(etree.SubElement(root, "url"), "loc").text = loc
Another important thing is the note about using .values()
. If you don't do that Django will create a model instance for every single row it returns of the iterator. That's expensive. See this blog post.
Another important thing is to use a Django ORM iterator as that's much more efficient than messing around with limits and offsets.
Generate the map of sitemaps
Making the map of maps doesn't need to be gzipped since it's going to be tiny.
def generate_map_of_maps(base_url, outfile):
root = etree.Element(
"sitemapindex", xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
)
with open(outfile, "wb") as f:
f.write(b'<?xml version="1.0" encoding="UTF-8"?>\n')
files_created = sorted(glob("sitemap-*.xml.gz"))
for file_created in files_created:
sitemap = etree.SubElement(root, "sitemap")
uri = "{}/{}".format(base_url, os.path.basename(file_created))
etree.SubElement(sitemap, "loc").text = uri
lastmod = datetime.datetime.fromtimestamp(
os.stat(file_created).st_mtime
).strftime("%Y-%m-%d")
etree.SubElement(sitemap, "lastmod").text = lastmod
f.write(etree.tostring(root, pretty_print=True))
And that sums it up. On my laptop, it takes about 60 seconds to generate 39 of these files (e.g. sitemap-1560000-1600000.xml.gz
) and that's good enough.
Bonus and Thoughts
The bad news is that this is about as good as it gets in terms of performance. The good news is that there are no low-hanging fruit fixes. I know, because I tried. I experimented with not using pretty_print=True
and I experimented with not writing with gzip.open
and instead gzipping the files on later. Nothing made any significant difference. The lxml.etree
part of this, in terms of performance, is order of maginitude marginal in comparison to the cost of actually getting the data out of the database plus later writing to disk. I also experimenting with generating the gzip content with zopfli
and it didn't make much of a difference.
I originally wrote this code years ago and when I did, I think I knew more about sitemaps. In my implementation I use a batch size of 40,000 so each file is called something like sitemap-40000-80000.xml.gz
and weighs about 800KB. Not sure why I chose 40,000 but perhaps not important.
Generate a random IP address in Python
June 1, 2019
0 comments Django, Python
I have a commenting system where people can type in a comment and optionally their name and email if they like.
In production, where things are real, the IP address that can be collected are all interestingly different. But when testing this manually on my laptop, since the server is running http://localhost:8000
, the request.META.get('REMOTE_ADDR')
always becomes 127.0.0.1
. Boring! So I fake it. Like this:
import random
from ipaddress import IPv4Address
def _random_ip_address(seed):
random.seed(seed)
return str(IPv4Address(random.getrandbits(32)))
...
# Here's the code deep inside the POST handler just before storing
# the form submission the database.
if settings.DEBUG and metadata.get("REMOTE_ADDR") == "127.0.0.1":
# Make up a random one!
metadata["REMOTE_ADDR"] = _random_ip_address(
str(form.cleaned_data["name"]) + str(form.cleaned_data["email"])
)
It's pretty rough but it works and makes me happy.
Best way to count distinct indexed things in PostgreSQL
March 21, 2019
3 comments Django, PostgreSQL
tl;dr; SELECT COUNT(*) FROM (SELECT DISTINCT my_not_unique_indexed_column FROM my_table) t;
I have a table that looks like this:
songsearch=# \d main_songtexthash Table "public.main_songtexthash" Column | Type | Collation | Nullable | -----------+--------------------------+-----------+----------+ id | integer | | not null | text_hash | character varying(32) | | not null | created | timestamp with time zone | | not null | modified | timestamp with time zone | | not null | song_id | integer | | not null | Indexes: "main_songtexthash_pkey" PRIMARY KEY, btree (id) "main_songtexthash_song_id_key" UNIQUE CONSTRAINT, btree (song_id) "main_songtexthash_text_hash_c2771f1f" btree (text_hash) "main_songtexthash_text_hash_c2771f1f_like" btree (text_hash varchar_pattern_ops) Foreign-key constraints: ...snip...
And the data looks something like this:
songsearch=# select text_hash, song_id from main_songtexthash limit 10; text_hash | song_id ----------------------------------+--------- 6f98e1945e64353bead9d6ab47a7f176 | 2565031 0c6662363aa4a340fea5efa24c98db76 | 486091 a25af539b183cbc338409c7acecc6828 | 212 5aaf561b38c251e7d863aae61fe1363f | 2141077 6a221df60f7cbb8a4e604f87c9e3aec0 | 245186 d2a0b5b3b33cdf5e03a75cfbf4963a6f | 1453382 95c395dd78679120269518b19187ca80 | 981402 8ab19b32b3be2d592aa69e4417b732cd | 616848 8ab19b32b3be2d592aa69e4417b732cd | 243393 01568f1f57aeb7a97e2544978fc93b4c | 333 (10 rows)
If you look carefully, you'll notice that every song_id
has a different text_hash
except two of them.
Song IDs 616848
and 243393
both have the same text_hash
of value 8ab19b32b3be2d592aa69e4417b732cd
.
Also, if you imagine this table only has 10 rows, you could quickly and easily conclude that there are 10 different song_id
but 9 different distinct text_hash
. However, how do you do this counting if the tables are large??
The Wrong Way
songsearch=# select count(distinct text_hash) from main_songtexthash; count --------- 1825983 (1 row)
And the explanation and cost analysis is:
songsearch=# explain analyze select count(distinct text_hash) from main_songtexthash; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Aggregate (cost=44942.09..44942.10 rows=1 width=8) (actual time=40029.225..40029.226 rows=1 loops=1) -> Seq Scan on main_songtexthash (cost=0.00..40233.87 rows=1883287 width=33) (actual time=0.029..193.653 rows=1879521 loops=1) Planning Time: 0.059 ms Execution Time: 40029.250 ms (4 rows)
Oh noes! A Sec Scan
! Run!
The Right Way
Better explained in this blog post but basically, cutting to the chase, here's how you count on an indexed field:
songsearch=# select count(*) from (select distinct text_hash from main_songtexthash) t; count --------- 1825983 (1 row)
And the explanation and cost analysis is:
songsearch=# explain analyze select count(*) from (select distinct text_hash from main_songtexthash) t; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=193871.20..193871.21 rows=1 width=8) (actual time=4894.555..4894.556 rows=1 loops=1) -> Unique (cost=0.55..172861.54 rows=1680773 width=33) (actual time=0.075..4704.741 rows=1825983 loops=1) -> Index Only Scan using main_songtexthash_text_hash_c2771f1f on main_songtexthash (cost=0.55..168153.32 rows=1883287 width=33) (actual time=0.074..4132.822 rows=1879521 loops=1) Heap Fetches: 1879521 Planning Time: 0.082 ms Execution Time: 4894.581 ms (6 rows)
Same exact result but ~5s instead of ~40s. I'll take that, thank you very much.
The Django Way
As a bonus: Django is smart. Here's how they do it:
>>> SongTextHash.objects.values('text_hash').distinct().count()
1825983
And, the SQL it generates to make that count looks very familiar:
SELECT COUNT(*) FROM (SELECT DISTINCT "main_songtexthash"."text_hash" AS Col1 FROM "main_songtexthash") subquery
Conclusion
- Avoid "sequential scans" like the plague if you care about performance (...or not just killing your resources).
- Trust in Django.