Fetching a File in NodeJS with Request and Parsing It with Readline

Recently, I was working on a project where I needed to fetch a file from the internet and parse it line by line. NodeJS has the readline module that can be used to split a data stream into lines and perform a function on each line, which was perfect for my needs. The problem is that most of the examples online show how to read a file from the file system and parse that. It wasn’t obvious to me how to parse a file that was actively being fetched from the internet.

After some experimentation, I figured out an elegant solution and I thought I’d share it here so other people could use the same technique.

Why would you parse a file that is still downloading?

The system that I was designing for is one that is potentially memory and hard disk constrained. It might be running in a serverless environment or it might just be a low-powered machine. Its job is to download and parse a bunch of large files. It needs to remove any comments and then update a database with the data it finds in the rest of the file.

An alternative method would be to download and save each file to the hard disk and then parse the file. This has the downside of requiring a lot of disk space and, if the server is operating in a serverless environment, it would need to be a central file storage system which may be more complicated to deal with. Also, I don’t need the files once they’ve been parsed so saving data to the disk is unnecessary.

Another alternative would be to fetch the entire file and then parse it. This has the downside of being slower (parsing doesn’t begin until the file is fully downloaded) and could take a lot of memory. The files I’m working with could be pretty large and that doesn’t include the memory to store the data that has been parsed from the file.

My solution was to parse the data as it streams in from the internet. I’m using the request module to fetch the data and I’m connecting it to the readline module that will parse the lines and call a function each time a new line is found. This means we’re not unnecessarily saving data and we’re not waiting for the whole file to download.

But as I mentioned in the intro, connecting request to readline wasn’t immediately obvious to me.

How do I connect readline to request in NodeJS?

The answer ended up being really simple in the end (as the best ones often are). The important thing to know is that the input for readline should be a readable stream and that request.get creates a readable stream.

Here’s the code:

const request = require('request');
const readline = require('readline');
function parseFile(url) {
  return new Promise((resolve, reject) => {
    const lineBuffer = [];
    const rl = readline.createInterface({
      input: request.get(url).on('error', (err) => reject(err)),
      .on('line', (line) => {
        // strip out comments or parse the line and push it to lineBuffer
      .on('close', () => resolve(lineBuffer));

readline and request are event based modules. This is why you see something like .on('event name', callback) several times in this code block. But I prefer to work with promises when possible so the function opens with a new Promise.

Next, I create an interface for readline and attach my request.get as the input. If my get fails, I reject the promise.

If readline receives a line event, that means I’ve downloaded a new line to be processed. When that occurs, I look for my comment delimiter at the beginning of the line. If I find it, I ignore the line. Otherwise, I process the line and add it’s data to my data buffer. I’m using an array for the data buffer because I still want the lines to be kept separate.

If readline receives a close event, that means I’ve downloaded the entire file, which means that I have all the data and I can resolve my promise with the completed data buffer.

And that’s it! A function that accepts a URL, downloads the file, ignores comments and returns the data, all with a nice Promise-based structure so you don’t end up in callback hell.

Leave a Comment