Databases

Any software developer should know about databases. What kinds of databases are there. Which want to help you and which want to bring you down under your knees and torture you. If you'll start a new project you need to know what database to use and what you'll be most productive with.

If you google right now what is the best database, you will get so many liberal, worthless responses which don't mean anything.

  • Best database is the one which gets the job done
  • Best database depends on the situation
  • Every database is good in its own way for the use case that it was intended to
  • There's no silver bullet

All of these worthless answers are actually saying: "I don't know jack about databases but I want to look smart on the internet so I'll write something worthless that doesn't say anything but sounds good to the simple ones"

    In this post I'll be advocating for these, in my opinion, least painful databases that I know:
  • Postgres
  • Sqlite

I'll talk in general about other database types that are out there, luring poor unsuspecting developer souls into a maintenance nightmare and a grievous heart.

Postgres is a relational database, just like Sqlite. Postgres is a server/client database, there's usually one beefy server (depends on deployment, but I'm simplifying it) but many things connect to it and make queries. Postgres is a huge tool chest, and you'll be able to implement just about any data abstraction you want on top of it.

Postgres vs key value stores

If you've ever used a key value store, usually there is no guarantee on whether your data fully commits, that you have no corruption and so on. You're banging on the keys with the values and you hope everything to be alright.

Unfortunately, there's a new breed, a new generation, which just want to start with a key/value store. They just say "our data is huge", "we cannot fit it into one server". So they lull themselves into thinking that starting with a key value store with no transactions or sanity is a good choice because of future flexibility/scalability. I've been in one talk at one conference where a manager said brilliant words on how to answer "But will it scale?" question. And his answer was simple, and ingenious: STFU. When we'll need to scale - then we'll worry about that. That was literally what he said. Because, now, implementing stuff in a relational database is much simpler and causes no havoc nor need to maintain clusters of computers with a broken key value stores. Define tables, define transactions, either all commit or all fail, boom, you're done. This gain in productivity is so massive, just using one database, and having everything in one place, that to give it up for some future "scalability" of using a complex key value store like mongo is just not worth it.

I'll give an example about one company where I worked at with an imbecile CTO and how using key value store from the start hurt us all (actually, that's just one bad part about that project, it's a payment system, and after that project I've started hating microservices).

So, it was a java shop, people are of low quality and low understanding, and there was a trend of chasing every new technology out there. And we had to make a mobile payment system. We'll, just a tiny itsy bitsy part of it - integrations with actual banks and mobile phones etc. First batshit insanity of the CTO was "we have to use many microservices BECUZ SCALE", other abomination was using etcd as a database. Just about every decent financial institution uses relational databases for such things because they take their data seriously. Here, we started with a key value store. Of course, it had to be highly available, there were many instances.

But we were so late with this project (I've quit before it was "finished"), because of microservices/hard deployments/no strong type safety (groovy abomination was used to code this)/crappy database which only insane person would trust/crapstain linkerd abomination which had to connect all our crapstain, tiny microservices and the end result was that we were so late in the market, the project took about two years (as far as I'm concerned, I've made more complex CRUD apps than that crap) and competitors got into the market with already working mobile payments. Put that in your pipe and smoke it, a crapstain complex microservices JVM app, that only does these things:

  • Gets http request and forwards it somewhere else (where backend actually uses a relational database to store data, LOL) and returns a response
  • Sometimes stores some state in etcd
  • Issues certificates for mobile clients

This crap took two years. And competitors got into the market with a working solution. Entire deployment, last I've heard needed about 10 beefy cluster machines (because JVM rocks, right?) for doing this crap excuse of computing. Battlefield 3 works with 8GB of RAM, does insanely complex computations, has insanely complex state, and it runs on my laptop, yet this pile of dog crap had more than 10 cluster machines around it. Someone was seriously screwed in the head when architecting this (and I'm glad I won't need to deal with that someone again).

Now, what it could have easily been:

  • Maybe three highly available postgres instances with failover
  • One monolithic app (I would write it in OCaml, not groovy, thus getting rid of insane memory requirements and gaining performance also)
  • which does exact same thing and could be scaled to 100 instances or so on multiple machines and would allow postgres to take care of locks/concurrency issues

This mobile payment app will probably never reach even thousand requests per second (it's a local app to a tiny country) so putting any more effort than to decently set up postgres is a waste. Even though, using crappy key value store like etcd was obviously wrong and insane, that might not have been a fatal nail to the coffin of this project, but it was certainly a contributor to it.

Crap key value store fanbois say about relational databases:

How about the performance when you draw from multiple tables and to multiple joins etc.?

Look, your crap app needs to do joins by hand, does it not? Or maybe you've architected it the way that everything is stored as one record and you don't need joins? Join performance is alright, and if you use indexes postgres optimizes it very well so that it could find records to join fast, on the other hand, you don't have joins at all in most key value stores fool! And when business person will ask you to compute some query, you will not be writing one sql statement and be done, and know you have consistent/atomic view of your data, but you'll be writing distributed map reduce function in erlang (http://howfuckedismydatabase.com/nosql/) and you'll be tempted to say "we don't have that data yet" which could be one sql query away.

And you know what else is funny? Mongo added left join capability as of 3.2 (2018) https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/#pipe._S_lookup, who needs joins, right?

We don't have schema, we don't have errors, we are free to change stuff dynamically!

Wow, nice, I can also change schema, writing a single sql statement which alters a table or two. And even so, postgres has jsonb datatype, and supports indexes on it, so I wouldn't care either. But I want to go further and say why this is a bad argument anyway. In relational database like postgres you can guard your data with constraints, you can ensure stuff like every client has a valid email and a valid name. You can ensure person's mobile phone number and email are valid at the point of insertion. You can make sure that every record is unique in some way or another. Sky is the limit of how rock solid you want to make your data. So, if I query rock solid data, which doesn't allow chipmunks in an array of horses, my confidence is much higher. And if you want to put anything you want in a database in one place in your code, then in another, inevitably you'll run into a situation where data returned is not expected by your code. Sure, can happen with relational database too, but it's much less likely when stuff in a single table is always the same.

And you know what's so funny? Mongo added this feature of schema validation to protect invalid data https://docs.mongodb.com/manual/core/schema-validation/ just like relational databases do, talk about hypocrisy.

Performance is amazing!

At what and which key value store? Because there are people who migrated from mongo to postgres and report better performance https://markandruth.co.uk/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb And again, if you do key value insert/update just in postgres, I got 40k inserts per second (roughly 3.4 billions inserts per day) without any heavy tuning, and that is more than enough for a starting business. And my question is, how much performance do you need right now? Because choice of which database to use usually comes at the start of your project, and, once you've chosen, there's a heavy cost for migration. And most projects/businesses that start do not have many clients at the very beginning.

Now sure, there are key value stores like aerospike, which can do million transactions per second in a single instance, which I do respect, because it squeezes every teeny bit of perf it can and it is used by large companies that actually need that, not at the very start because it suffers from all the problems mentioned here also.

So, certainly not all key value stores can beat the performance of a relational database, and even if they can, when you choose a database, is it worth leaving all safety and correctness guarantees provided by a relational database?

It is clustered, runs on multiple machines!

First of all, that is not a benefit in itself, it is a maintenance burden. Problems are hard enough on one machine to deal with, when things go distributed you get problems of single machine + multiple of them working in the same environment. But sometimes you may need to get outside of one machine to distributed computations.

Postgres vs big data/hadoop

First of all, if you're starting a new project, a startup, a business, you are not a big data. You have no right to utter these dreadful words. I actually work with big data right now and you know what I'd like? A single parallel postgres machine with 100TB of RAM and 10000 cores. Unfortunately, postgres does not utilize multiple cores too well as of time of this writing, so, we have to use spark, which uses JVM, which requires a lot of memory, which requires ton of babysitting because JVM doesn't allow you past 32GB memory, which requires to repartition your data so that it would fit into one computer, which requires babysitting all jobs so that all of them wouldn't starve out each other and pipelines would finish successfully.

If you're a startup, you don't want to get into this territory, how about making your business successfull with one instance, enjoy productivity and flexibility and don't suffer pain, and once your database cannot deliver you queries in time, and you've tried everything to optimize it so it would, how about then you go and meet the big data beast face to face?

Postgres vs queues

I've seen so many people have a database, and then have some pile of crap queue component which needs to be maintained. Postgres added notifications (https://www.postgresql.org/docs/9.3/static/libpq-notify.html) so now we don't have to bring in some worthless, buggy, memory hogging queue implementation on JVM and we can just implement queue on top of postgres and enjoy our safe rock solid world of transactions and consistency (just one implementation https://github.com/chanks/que).

Postgres vs search engine (like elasticsearch)

You can have a decent search with postgres, with customized weights and so on http://rachbelaid.com/postgres-full-text-search-is-good-enough/ . It might not have all the bells and whistles like elasticsearch does, but also, it does not have a maintenance cost of maintaining elastic cluster and for a startup will certainly give you lot of mileage until you actually need to scale and it is not your imagination that you do.

Postgres vs time series database (influxdb)

Just try postgres, and if it does not perform just try to optimize it better? https://news.ycombinator.com/item?id=9805742

Postgres vs graph database

First of all, maybe you don't need graph database because I don't know any businesses personally around me that have yet had to do this, and they handle millions of clients, and most simple businesses use relational databases, but even if you do need a graph database, could you try "WITH RECURSIVE" query first? https://www.postgresql.org/docs/9.5/static/queries-with.html

Conclusion about postgres

    All in all, if you start a startup, business, you know you will not have many clients at the start and need to chose a database:
  • NEVER EVER, EVER EVER, EVER EVER EVER EVER START WITH A KEY VALUE STORE

Chances are, postgres will satisfy your every need related to data at the beginning.

Sqlite

If I need to store data locally, which doesn't have to be shared and it's not a website I use sqlite. I get tiny, low memory footprint postgres on my machine with same transactional semantics as I do with postgres. If you need some persistent state for a local application running on your computer - go with sqlite. Do not make a hipster homegrown file format. You'll cry in the end. Firefox uses it, chrome uses it, your android phone uses it. Chances are, if firefox is satisfied with sqlite - your home grown app will be too.

Honorable hipster mention

Postgres and sqlite will satisfy 99% of your database needs. To use some other monstrous beast other than these two is an exception, not the rule. However, I want to share some hipster database I've used which brought much joy to my heart: Prolog. Prolog is the easiest to use database I've ever used, syntax very basic, much simpler than sql yet extremely powerful. Consider this code https://github.com/Anniepoo/prolog-examples/blob/master/familytree.pl which expresses basic relationships between people, who is a father to who, who is a brother, etc. There are no selects/wheres/joins - just relationships. And query is simply omitting an argument for a rule which should be computed. This is, usability wise, best database I've ever used in my life. Feel free to study it, you will not be disappointed. I have no experience using it in production, sadly, and very few people know syntax, but I believe it's worth exploring whether postgres could be queried with prolog syntax. I'd love that. Some projects are there http://hiro-tan.org/~ekoontz/psqlog/doc/psqlog.html but so far they are not as elegant as pure prolog.

So, that was a lot of information for you, see you later, Abner.