Paolo Fragomeni


working
with
software
paolo
fragomeni

Table of contents

The ultimate database of the future Jan 8 2013

What does this picture mean? Oct 21 2012

Hello, world. Oct 14 2012

The ultimate database of the future

What does the ultimate database look like? Maybe the ultimate database of the future is a many-headed hydra that will attempt to solve all problems. Maybe marketing teams are going to be at the heart of its success. Perhaps it will grow tentacles and become violent toward its creator.

Hydra

What does your current database do?

Unless you're a contributor to the software, you're probably limited to understanding the value propositions and a subset of features. Maybe your an expert. Probably not. This corner is usually dark. A database has historically been a black box. You pick one that seems like the best fit and trust it.

Maybe the database of the future should just be a library, more like BerkeleyDB, I really don't want an entire server. Someone could just add a server on top of it if they really wanted one. It would also be nice if understanding it (in its entirety) wouldn't represent a big investment.

How many features should it have?

The complexity of a system can be quantified as the number of steps that it takes to bring it to its intended state. My general rule with complexity is that evidence must be provided that something is needed, otherwise it doesn't belong.

My ideal database would start with only the absolute minimum viable means to efficiently store and retrieve arbitrary data. All additional features would be abstracted into discrete modules. This model was proven to be quite successful with Node.js. Load balancing, Map Reduce, Replication, etc. could all be modules!

Should it be relational?

Of course not. That's crazy. Someone will implement a SQL module.

Wait. Isn't the future now because leveldb?

No. The future is never now. But LevelDB aligns more with what I'm looking for in the "ultimate database of the future".

Hydra

What is Leveldb?

Leveldb is a small C++ library. It's classified as a key value store with a Log-Structured Merge-Tree (LSM) architecture. It performs very fast range queries.

How does Leveldb work?

The contents of the database are stored in a set of files in the filesystem. You can learn about all of the files that get created here. I'll cover the ones that are relevant to getting a basic understanding of leveldb.

Log files (*.log) are append-only. They contain a sequence of updates. A copy of the log file lives in memory. Writes get logged to this structure and reads happen here first so that recent updates are reflected. This memory structure is periodically merged into an SST during compaction. Keeping a percentage of the recently active data as well as indexes in memory makes leveldb efficient for both random reads, writes and range queries. The search performance is O(log N) with a very large branching factor.

Leveldb persists data in Sorted String Tables (*.sst). SSTs are files filled with immutable, arbitrary, key-value pairs sorted on their keys. Keys and values are arbitrary blobs. Each entry's value is either a value for the key, or a deletion marker for the key. The indexes for these files are also loaded into memory to speed things up. SSTs can be used to exchange up to Terabytes of sorted data segments.

SSTs are organized into a sequence of levels and continuously compacted over time. Here is an example directory listing of a database.

total 15056
drwxr-xr-x  12       408 Jan  8 17:26 ./
drwxr-xr-x   7       238 Jan  8 17:26 ../
-rw-r--r--   1   1903717 Jan  7 13:20 000005.sst
-rw-r--r--   1   4347280 Jan  7 13:21 000008.sst
-rw-r--r--   1   1366221 Jan  7 13:21 000009.sst
-rw-r--r--   1         0 Jan  7 13:38 000014.log
-rw-r--r--   1        16 Jan  7 13:38 CURRENT
-rw-r--r--   1         0 Jan  7 13:20 LOCK
-rw-r--r--   1       166 Jan  7 13:38 LOG
-rw-r--r--   1       165 Jan  7 13:23 LOG.old
-rw-r--r--   1     65536 Jan  7 13:38 MANIFEST-000013

Things that people have said about leveldb.

"It empowers you to write your own db, from a super light but fast filesystem abstraction to a beefed up server with custom replication schemes and other application-logic right at the heart of this super fast thing.[...]" - @juliangruber

"You could implement something like lambda architecture completely on top of leveldb. It's a basic building block, you can build your own db abstraction with your own trade offs on top of it." - @raynos

"[...] Instead of storing your data structures in memory, store them in a LevelDB database and fetch when needed -- this way you can store a ton of data and not have to worry about RAM and you get to keep that data across restarts. This is how I mainly use it, LevelUP makes it super simple and fetching & processing large amounts via a readStream() is just so nice to work with. If you use the inbuilt JSON encoding then you get to pretend the data is in memory (except where serialization/deserialization may change the form of your data, like Date objects), since LevelDB is so fast and all operations are async the speed impact of storing on disk is hardly noticeable." - @rvagg

"Leveldb is the node.js of databases" - @dominictarr

How can I use leveldb with Node.js?

Levelup is a driver for LevelDB and people have started writing lots of useful modules, plugins and tools surrounding it.

To get my hands dirty, I wrote a command line tool and REPL (with autosuggestion and autocomplete for keys) to help query and manage leveldb instances. Star it and tweet about it if you like it.







What does this picture mean?

Recently, Dominic Tarr asked me why I used this image (seen at the left, but not on mobile) for my blog. The image is from William Cheselden’s Osteographia or the Anatomy of the Bones. Here is the complete illustration.

When it's time, it's time

I like this illustration. It's a good software analog. The actors here can only exist in the abstract. The one in the center is really expressive. His emotional posture animated and then captured in time. I think he's holding an implement, a shovel? He looks like a laborer and he's definitely got something to say about the work he does.




Hello, world.

Hi my name is Paolo! I used to work at MIT. Before that I worked at some banks. Now I'm co-founder and CTO at Nodejitsu.

I decided to start blogging. As an exercise, I also decided to write the software that would run the blog using Node.js and Markdown. As an introductory post, I'd like to explain how it works.

Before I get started I'd like to mention, there are a lot of great options for blogging. I'm just doing this for fun, and its a good vehicle for sharing some thoughts on Node.js. If you are new to node this will be cool, if your not, skip this post.

First of all, download Node.js. Then you'll want to download the code for this blog from github. Node.js is simple. Really simple. There are more interesting reasons to use Node.js, but the fact that it is simple is one of the most compelling. Node is run from the command line, so to use it open your terminal.

Let's write a small node program.

Create a folder on your desktop and put a text file in it called simple.js. In this text file add console.log('hello, world');. On the command line, navigate to this folder and type node simple.js. The program will output the following text hello, world.

So as you can see, a node program is just a text file with Javascript in it. Node gives you Javascript, but also allows you to interface with other parts of your system such as a disk or network, etc. Obviously, when a program gets more complex and a file gets bigger, you'll need to break it apart into smaller more manageable parts. In the Node.js world, they call these modules. Modules can be code that you write or they can be code that someone else has written. This blog uses some modules that have been written by other people.

Working with modules

The code we want to run is in server/index.js. Let's start with the first 9 lines of code. This is how we get modules into our program. When you get a module, you use the require function to specify what module you want. Some of these modules are a part of node, and some of them are written by other people. When they are written by other people, they are called packages.

var http = require('http');
var url = require('url');
var fs = require('fs');
var path = require('path');

var marked = require('marked');
var hljs = require('highlight.js');
var mime = require('mime');

If you run node server/index.js, you will notice that Node complains and that there is some missing software.

module.js:340
    throw err;
          ^
Error: Cannot find module 'marked'
    at Function.Module._resolveFilename (module.js:338:15)
    at Function.Module._load (module.js:280:25)
    at Module.require (module.js:362:17)
    at require (module.js:378:17)
    at Object.<anonymous> (/Users/paolo/workroot/git/hij1nx/blog/server/index.js:7:14)
    at Module._compile (module.js:449:26)
    at Object.Module._extensions..js (module.js:467:10)
    at Module.load (module.js:356:32)
    at Function.Module._load (module.js:312:12)
    at Module.runMain (module.js:492:10)

In the code, we require marked, highlight.js and mime, but they are not modules that ship with node. So we need to somehow get these. Fortunately, node ships with a program called npm (Node Package Manager). npm is run from the command line and can help us install modules.

There is a file in the root of this project called packge.json. It is a JSON file that contains information about the program and what modules it needs.

{
    "name": "blog",
    "description": "a minimalist blog",
    "version": "0.0.1",
    "dependencies": {
        "marked": "*",
        "mime": "*",
        "highlight.js": "*"
    }
}

When a program needs other modules to run, they are called dependencies. To install the dependencies, type npm install within the directory where you downloaded the code. npm will look at the package.json file and attempt to download the dependencies for this project, then download them so you can run the program. npm is great.

After we require some modules we start doing some things.

Working with file paths

After we require some modules. Let's say that we want to read a file from the disk. Parsing and normalizing strings and ensuring that the path divider is in the correct place (and is appropriate for your operating system) can be very tedious. Node's path module helps deal with this.

var indexpath = path.join(__dirname, '..', 'public', 'index.html');

You'll notice __dirname is being joined in the path, this is a global variable that represents the name of the directory that the currently executing script resides in. After joining the path it looks like /Users/paolo/workroot/git/hij1nx/blog/public/index.html on my machine.

Building the content

Let's read a file from the disk. Its the main html file for the blog. It contains most of the layout.

var index = fs.readFileSync(indexpath, 'utf8');

The part responsible for reading the file is fs.readFileSync(path). This synchronously reads a file from the disk. It's ok to read the file synchronously in this case, because its relatively small and we only do it only once. utf8 tells Node what type of data we're dealing with. We then store the value in a local variable named index.

After that we set up two arrays, one to build the content and a table of contents. We will eventually merge this into the data that is in the index variable.

var content = [], toc = [];

Now we're going to store our blog's content in markdown format. But we're not going to present markdown to our readers obviously. So we need a good way to parse it and turn it into html. Nodejitsu's Christopher Jeffrey has written a comprehensive and feature complete module for parsing markdown. Here we set some options on it.

We pass the marked module an option responsible for highlighting code. If you look at the data folder in this project, you will find markdown files that contain triple back ticks enclosing code snippets. After the first set of backticks, you will see the language specified.

marked.setOptions({
  gfm: true,
  pedantic: false,
  sanitize: true,
  highlight: function(code, lang) {
        return hljs.highlight(lang, code).value;
  }
});

The next thing we need to do is get all the markdown files. It seemed sensible to store the files named with the date and time. Naming the file with the title of the blog can get really ugly and naming them numerically doesn't provide me enough context.

Using fs.readdirSync(path) we can get an array of everything that is in a directory. We'll also want to parse the names of the files into Javascript dates, so lets remove the file extension for now.

var datapath = path.join(__dirname, '..', 'data');
var filenames = fs.readdirSync(datapath);

for (var i = 0, l = filenames.length; i<l; i++) {
  filenames[i] = path.basename(filenames[i], '.md');
}

Most blogs will present the most recent article first, so we can use the filename to sort the array of file names.

filenames
    .sort(function (date1, date2) {

        //
      // This is a comparison function that will result in 
      // dates being sorted in descending order.
      //
      var date1 = new Date(Date.parse(date1));
      var date2 = new Date(Date.parse(date2));

      if (date1 > date2) return -1;
      if (date1 < date2) return 1;
      return 0;
    })

We iterate over the array of file names and read the file from disk. After getting each file we can run the markdown renderer on it. Also, since the filename is the data, why not save it off to a local variable and use it.

I think an h1 tag is a good way to identify articles. So we convert each h1 into a link. This allows deep linking back to the article from the url. Might as well build a table of contents at the same time!

    .forEach(function (name) {

      //
      // get each markdown file and convert it into html.
      //

      // the file name should be a parsable date.
      var date = name;

      //
      // add the file extension back since we now want to
      // read it from the disk.
      //
      name = path.join(__dirname, '..', 'data', name + '.md');
      var data = fs.readFileSync(name, 'utf8');

      //
      // change the headers to links to provide deep linking.
      //
      var markup = marked(data).replace(/<h1>(.*?)<\/h1>/, function(a, h1) {

            // turn the title into something that we can use as a link.
          var id = h1.replace(/ /g, '-');

          // add a link to the article to the table of contents.
          toc.push('<a href="#' + id + '">' + h1 + 
          '</a> <span class="date">' + date + '</span>');

          // return the new version of the header.
          return '<a id="' + id + '"><h1><a href="#' + id + '">' + h1 + '</a></h1>';
      });

      content.push(markup);
    });

Now that we've built the content, we can merge the content into a larger document. We do that by simply looking for a unique token in the index variable's data.

index = index.replace('<!-- toc -->', toc.join('<br/>'));
index = index.replace('<!-- content -->', content.join('<br/><hr><br/>'));

Serving the content

Next we're going to set up a server. Before we do this, it helps to understand how simple an http server is. Run this with node and then to to http://127.0.0.1:8000 in your browser. Yes, it's that simple.

var http = require('http');
http.createServer(function (req, res) {

  res.writeHead(200, {'Content-Type': 'text/plain'});
  res.end('Hello World\n');
}).listen(8000, '127.0.0.1');

Ok, now let's go back to looking at the code from our program. We set up our http server and when a connection is made, we get a request (req) and a response (res) object. We check the url of the request to find out if it's a request for the index file, if it is, we can just serve it and be done.

http.createServer(function (req, res) {

    //
    // a request without any specific files
    //
  if (req.url === '/' || req.url === '/index.html') {
    res.statusCode = 200;
    res.writeHeader('Content-Type', 'test/html');
    res.end(index);
    return;
  }

Remember that once we serve this file, the browser is going to be asking us for all kinds of stuff, style sheets, images, javascript, etc. We need to accommodate those requests. First we parse the url and ensure we only serve a file from the public folder of our application. It would be a huge liability to allow people to request any file from our file system.

  //
  // figure out what's in the request.
  //
  var rawurl = url.parse(req.url);
  var pathname = decodeURI(rawurl.pathname);
  var base = path.join(__dirname, '..', 'public');
  var filepath = path.normalize(path.join(base, pathname));

Next, we are telling the browser what kind of file we are serving it. By passing a file extension to the mime module, it will tell us the appropriate mime type. If we can't find a mime type, abort the request.

  //
  // set the appropriate mime type if possible.
  //
  var mimetype = mime.lookup(path.extname(filepath).split(".")[1]);

  if (!mimetype) {
    return;
  }

  res.writeHeader('Content-Type', mimetype);

Next we should check to see if the file actually exists. In order to do this, we use the fs.stat(path, callback) method. If there is a problem, the err object will not be null. It is a typical pattern in Node for an asynchronous operation to have the error as the first argument.

We don't want to serve directories. So if it's not a directory we should serve it. We can serve it by piping the file directly to the response object. Ok, that might sound a little hand wavy. Let's conclude by having the server listen on port 80. But Let's talk about the hand wavy part after looking at this code.

  //
  // find out if the file is there and if it is serve it...
  //
  fs.stat(filepath, function (err, stat) {

    if (err && err.code === 'ENOENT') {
        res.statusCode = 404;
        res.end('not found');
    }
    else {
      if (!stat.isDirectory()) {
        res.statusCode = 200;
        fs.createReadStream(filepath).pipe(res);
      }
    }
  });

}).listen(80);

Streaming Abstractions for fun and performance

Ok, what does it mean to pipe a file to a response object. To understand this, we first need to understand what streams are. You can think of a stream just like the stream of water that comes out of the tap in your kitchen sink. Let's continue this analogy.

If you stand with your mouth under the tap, no one else can drink from it until you are done. A streaming abstraction is like pouring cups of water and then handing them out. Once you're done you can come back for more and everyone gets a chance to drink.

Many IO operations in Node have streaming abstractions. The main reason for this is that we don't know how big data is going to be. Data might even be continuous and if it is, we don't want to put everything or everyone on hold while we deliver it.

One of Node's goals is to handle network concurrency well. Streaming abstractions and features such as the Event Loop help facilitate this. For more about this, you might want to read Node's technical overview.

Conclusion

This was a fun article to write and code! I hope it helped you to understand something about Node. If you like it, please star the project on github. If you find any issues with it, open an issue or send a pull request!