Paolo Fragomeni
What does the ultimate database look like? Maybe the ultimate database of the future is a many-headed hydra that will attempt to solve all problems. Maybe marketing teams are going to be at the heart of its success. Perhaps it will grow tentacles and become violent toward its creator.
Unless you're a contributor to the software, you're probably limited to understanding the value propositions and a subset of features. Maybe your an expert. Probably not. This corner is usually dark. A database has historically been a black box. You pick one that seems like the best fit and trust it.
Maybe the database of the future should just be a library, more like BerkeleyDB, I really don't want an entire server. Someone could just add a server on top of it if they really wanted one. It would also be nice if understanding it (in its entirety) wouldn't represent a big investment.
The complexity of a system can be quantified as the number of steps that it takes to bring it to its intended state. My general rule with complexity is that evidence must be provided that something is needed, otherwise it doesn't belong.
My ideal database would start with only the absolute minimum viable means to efficiently store and retrieve arbitrary data. All additional features would be abstracted into discrete modules. This model was proven to be quite successful with Node.js. Load balancing, Map Reduce, Replication, etc. could all be modules!
Of course not. That's crazy. Someone will implement a SQL module.
No. The future is never now. But LevelDB aligns more with what I'm looking for in the "ultimate database of the future".
Leveldb is a small C++ library. It's classified as a key value store with a Log-Structured Merge-Tree (LSM) architecture. It performs very fast range queries.
The contents of the database are stored in a set of files in the filesystem. You can learn about all of the files that get created here. I'll cover the ones that are relevant to getting a basic understanding of leveldb.
Log files (*.log) are append-only. They contain a sequence of updates. A copy of the log file lives in memory. Writes get logged to this structure and reads happen here first so that recent updates are reflected. This memory structure is periodically merged into an SST during compaction. Keeping a percentage of the recently active data as well as indexes in memory makes leveldb efficient for both random reads, writes and range queries. The search performance is O(log N) with a very large branching factor.
Leveldb persists data in Sorted String Tables (*.sst). SSTs are files filled with immutable, arbitrary, key-value pairs sorted on their keys. Keys and values are arbitrary blobs. Each entry's value is either a value for the key, or a deletion marker for the key. The indexes for these files are also loaded into memory to speed things up. SSTs can be used to exchange up to Terabytes of sorted data segments.
SSTs are organized into a sequence of levels and continuously compacted over time. Here is an example directory listing of a database.
total 15056
drwxr-xr-x 12 408 Jan 8 17:26 ./
drwxr-xr-x 7 238 Jan 8 17:26 ../
-rw-r--r-- 1 1903717 Jan 7 13:20 000005.sst
-rw-r--r-- 1 4347280 Jan 7 13:21 000008.sst
-rw-r--r-- 1 1366221 Jan 7 13:21 000009.sst
-rw-r--r-- 1 0 Jan 7 13:38 000014.log
-rw-r--r-- 1 16 Jan 7 13:38 CURRENT
-rw-r--r-- 1 0 Jan 7 13:20 LOCK
-rw-r--r-- 1 166 Jan 7 13:38 LOG
-rw-r--r-- 1 165 Jan 7 13:23 LOG.old
-rw-r--r-- 1 65536 Jan 7 13:38 MANIFEST-000013
"It empowers you to write your own db, from a super light but fast filesystem abstraction to a beefed up server with custom replication schemes and other application-logic right at the heart of this super fast thing.[...]" - @juliangruber
"You could implement something like lambda architecture completely on top of leveldb. It's a basic building block, you can build your own db abstraction with your own trade offs on top of it." - @raynos
"[...] Instead of storing your data structures in memory, store them in a LevelDB database and fetch when needed -- this way you can store a ton of data and not have to worry about RAM and you get to keep that data across restarts. This is how I mainly use it, LevelUP makes it super simple and fetching & processing large amounts via a readStream() is just so nice to work with. If you use the inbuilt JSON encoding then you get to pretend the data is in memory (except where serialization/deserialization may change the form of your data, like Date objects), since LevelDB is so fast and all operations are async the speed impact of storing on disk is hardly noticeable." - @rvagg
"Leveldb is the node.js of databases" - @dominictarr
Levelup is a driver for LevelDB and people have started writing lots of useful modules, plugins and tools surrounding it.
To get my hands dirty, I wrote a command line tool and REPL (with autosuggestion and autocomplete for keys) to help query and manage leveldb instances. Star it and tweet about it if you like it.
Recently, Dominic Tarr asked me why I used this image (seen at the left, but not on mobile) for my blog. The image is from William Cheselden’s Osteographia or the Anatomy of the Bones. Here is the complete illustration.
I like this illustration. It's a good software analog. The actors here can only exist in the abstract. The one in the center is really expressive. His emotional posture animated and then captured in time. I think he's holding an implement, a shovel? He looks like a laborer and he's definitely got something to say about the work he does.
Hi my name is Paolo! I used to work at MIT. Before that I worked at some banks. Now I'm co-founder and CTO at Nodejitsu.
I decided to start blogging. As an exercise, I also decided to write the software that would run the blog using Node.js and Markdown. As an introductory post, I'd like to explain how it works.
Before I get started I'd like to mention, there are a lot of great options for blogging. I'm just doing this for fun, and its a good vehicle for sharing some thoughts on Node.js. If you are new to node this will be cool, if your not, skip this post.
First of all, download Node.js. Then you'll want to download the code for this blog from github. Node.js is simple. Really simple. There are more interesting reasons to use Node.js, but the fact that it is simple is one of the most compelling. Node is run from the command line, so to use it open your terminal.
Create a folder on your desktop and put a text file in it called simple.js.
In this text file add console.log('hello, world');. On the command line,
navigate to this folder and type node simple.js. The program will
output the following text hello, world.
So as you can see, a node program is just a text file with Javascript in it.
Node gives you Javascript, but also allows you to interface with other parts of
your system such as a disk or network, etc. Obviously, when a program gets more
complex and a file gets bigger, you'll need to break it apart into smaller more
manageable parts. In the Node.js world, they call these modules. Modules can
be code that you write or they can be code that someone else has written. This
blog uses some modules that have been written by other people.
The code we want to run is in server/index.js. Let's start with the first 9
lines of code. This is how we get modules into our program. When you get a
module, you use the require function to specify what module you want. Some of
these modules are a part of node, and some of them are written by other
people. When they are written by other people, they are called packages.
var http = require('http');
var url = require('url');
var fs = require('fs');
var path = require('path');
var marked = require('marked');
var hljs = require('highlight.js');
var mime = require('mime');
If you run node server/index.js, you will notice that Node complains and
that there is some missing software.
module.js:340
throw err;
^
Error: Cannot find module 'marked'
at Function.Module._resolveFilename (module.js:338:15)
at Function.Module._load (module.js:280:25)
at Module.require (module.js:362:17)
at require (module.js:378:17)
at Object.<anonymous> (/Users/paolo/workroot/git/hij1nx/blog/server/index.js:7:14)
at Module._compile (module.js:449:26)
at Object.Module._extensions..js (module.js:467:10)
at Module.load (module.js:356:32)
at Function.Module._load (module.js:312:12)
at Module.runMain (module.js:492:10)
In the code, we require marked, highlight.js and mime, but they are not
modules that ship with node. So we need to somehow get these. Fortunately, node
ships with a program called npm (Node Package Manager). npm is run from
the command line and can help us install modules.
There is a file in the root of this project called packge.json. It is a JSON
file that contains information about the program and what modules it needs.
{
"name": "blog",
"description": "a minimalist blog",
"version": "0.0.1",
"dependencies": {
"marked": "*",
"mime": "*",
"highlight.js": "*"
}
}
When a program needs other modules to run, they are called dependencies.
To install the dependencies, type npm install within the directory where
you downloaded the code. npm will look at the package.json file and attempt
to download the dependencies for this project, then download them so you can
run the program. npm is great.
After we require some modules we start doing some things.
After we require some modules. Let's say that we want to read a file from
the disk. Parsing and normalizing strings and ensuring that the path divider
is in the correct place (and is appropriate for your operating system) can
be very tedious. Node's path module helps deal with this.
var indexpath = path.join(__dirname, '..', 'public', 'index.html');
You'll notice __dirname is being joined in the path, this is a global
variable that represents the name of the directory that the currently
executing script resides in. After joining the path it looks like
/Users/paolo/workroot/git/hij1nx/blog/public/index.html on my machine.
Let's read a file from the disk. Its the main html file for the blog. It contains most of the layout.
var index = fs.readFileSync(indexpath, 'utf8');
The part responsible for reading the file is fs.readFileSync(path).
This synchronously reads a file from the disk. It's ok to read the file
synchronously in this case, because its relatively small and we only do it
only once. utf8 tells Node what type of data we're dealing with. We then
store the value in a local variable named index.
After that we set up two arrays, one to build the content and a table of
contents. We will eventually merge this into the data that is in the index
variable.
var content = [], toc = [];
Now we're going to store our blog's content in markdown format. But we're not going to present markdown to our readers obviously. So we need a good way to parse it and turn it into html. Nodejitsu's Christopher Jeffrey has written a comprehensive and feature complete module for parsing markdown. Here we set some options on it.
We pass the marked module an option responsible for highlighting code. If
you look at the data folder in this project, you will find markdown files
that contain triple back ticks enclosing code snippets. After the first set
of backticks, you will see the language specified.
marked.setOptions({
gfm: true,
pedantic: false,
sanitize: true,
highlight: function(code, lang) {
return hljs.highlight(lang, code).value;
}
});
The next thing we need to do is get all the markdown files. It seemed sensible to store the files named with the date and time. Naming the file with the title of the blog can get really ugly and naming them numerically doesn't provide me enough context.
Using fs.readdirSync(path) we can get an array of everything that is
in a directory. We'll also want to parse the names of the files into
Javascript dates, so lets remove the file extension for now.
var datapath = path.join(__dirname, '..', 'data');
var filenames = fs.readdirSync(datapath);
for (var i = 0, l = filenames.length; i<l; i++) {
filenames[i] = path.basename(filenames[i], '.md');
}
Most blogs will present the most recent article first, so we can use the filename to sort the array of file names.
filenames
.sort(function (date1, date2) {
//
// This is a comparison function that will result in
// dates being sorted in descending order.
//
var date1 = new Date(Date.parse(date1));
var date2 = new Date(Date.parse(date2));
if (date1 > date2) return -1;
if (date1 < date2) return 1;
return 0;
})
We iterate over the array of file names and read the file from disk. After getting each file we can run the markdown renderer on it. Also, since the filename is the data, why not save it off to a local variable and use it.
I think an h1 tag is a good way to identify articles. So we convert
each h1 into a link. This allows deep linking back to the article from
the url. Might as well build a table of contents at the same time!
.forEach(function (name) {
//
// get each markdown file and convert it into html.
//
// the file name should be a parsable date.
var date = name;
//
// add the file extension back since we now want to
// read it from the disk.
//
name = path.join(__dirname, '..', 'data', name + '.md');
var data = fs.readFileSync(name, 'utf8');
//
// change the headers to links to provide deep linking.
//
var markup = marked(data).replace(/<h1>(.*?)<\/h1>/, function(a, h1) {
// turn the title into something that we can use as a link.
var id = h1.replace(/ /g, '-');
// add a link to the article to the table of contents.
toc.push('<a href="#' + id + '">' + h1 +
'</a> <span class="date">' + date + '</span>');
// return the new version of the header.
return '<a id="' + id + '"><h1><a href="#' + id + '">' + h1 + '</a></h1>';
});
content.push(markup);
});
Now that we've built the content, we can merge the content into
a larger document. We do that by simply looking for a unique token
in the index variable's data.
index = index.replace('<!-- toc -->', toc.join('<br/>'));
index = index.replace('<!-- content -->', content.join('<br/><hr><br/>'));
Next we're going to set up a server. Before we do this, it helps to
understand how simple an http server is. Run this with node and then to to
http://127.0.0.1:8000 in your browser. Yes, it's that simple.
var http = require('http');
http.createServer(function (req, res) {
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello World\n');
}).listen(8000, '127.0.0.1');
Ok, now let's go back to looking at the code from our program. We set up our
http server and when a connection is made, we get a request (req) and a
response (res) object. We check the url of the request to find out if it's
a request for the index file, if it is, we can just serve it and be done.
http.createServer(function (req, res) {
//
// a request without any specific files
//
if (req.url === '/' || req.url === '/index.html') {
res.statusCode = 200;
res.writeHeader('Content-Type', 'test/html');
res.end(index);
return;
}
Remember that once we serve this file, the browser is going to be asking us
for all kinds of stuff, style sheets, images, javascript, etc. We need to
accommodate those requests. First we parse the url and ensure we only serve
a file from the public folder of our application. It would be a huge
liability to allow people to request any file from our file system.
//
// figure out what's in the request.
//
var rawurl = url.parse(req.url);
var pathname = decodeURI(rawurl.pathname);
var base = path.join(__dirname, '..', 'public');
var filepath = path.normalize(path.join(base, pathname));
Next, we are telling the browser what kind of file we are serving it. By
passing a file extension to the mime module, it will tell us the appropriate
mime type. If we can't find a mime type, abort the request.
//
// set the appropriate mime type if possible.
//
var mimetype = mime.lookup(path.extname(filepath).split(".")[1]);
if (!mimetype) {
return;
}
res.writeHeader('Content-Type', mimetype);
Next we should check to see if the file actually exists. In order to do this,
we use the fs.stat(path, callback) method. If there is a problem, the
err object will not be null. It is a typical pattern in Node for an
asynchronous operation to have the error as the first argument.
We don't want to serve directories. So if it's not a directory we should serve it. We can serve it by piping the file directly to the response object. Ok, that might sound a little hand wavy. Let's conclude by having the server listen on port 80. But Let's talk about the hand wavy part after looking at this code.
//
// find out if the file is there and if it is serve it...
//
fs.stat(filepath, function (err, stat) {
if (err && err.code === 'ENOENT') {
res.statusCode = 404;
res.end('not found');
}
else {
if (!stat.isDirectory()) {
res.statusCode = 200;
fs.createReadStream(filepath).pipe(res);
}
}
});
}).listen(80);
Ok, what does it mean to pipe a file to a response object. To understand
this, we first need to understand what streams are. You can think of
a stream just like the stream of water that comes out of the tap in your
kitchen sink. Let's continue this analogy.
If you stand with your mouth under the tap, no one else can drink from it
until you are done. A streaming abstraction is like pouring cups of water
and then handing them out. Once you're done you can come back for more and
everyone gets a chance to drink.
Many IO operations in Node have streaming abstractions. The main reason for this is that we don't know how big data is going to be. Data might even be continuous and if it is, we don't want to put everything or everyone on hold while we deliver it.
One of Node's goals is to handle network concurrency well. Streaming abstractions and features such as the Event Loop help facilitate this. For more about this, you might want to read Node's technical overview.
This was a fun article to write and code! I hope it helped you to understand something about Node. If you like it, please star the project on github. If you find any issues with it, open an issue or send a pull request!