convert streamed buffers to utf8-string


Question

I want to make a HTTP-request using node.js to load some text from a webserver. Since the response can contain much text (some Megabytes) I want to process each text chunk separately. I can achieve this using the following code:

var req = http.request(reqOptions, function(res) {
    ...
    res.setEncoding('utf8');
    res.on('data', function(textChunk) {
        // process utf8 text chunk
    });
});

This seems to work without problems. However I want to support HTTP-compression, so I use zlib:

var zip = zlib.createUnzip();

// NO res.setEncoding('utf8') here since we need the raw bytes for zlib
res.on('data', function(chunk) {
    // do something like checking the number of bytes downloaded
    zip.write(chunk); // give the raw bytes to zlib, s.b.
});

zip.on('data', function(chunk) {
    // convert chunk to utf8 text:
    var textChunk = chunk.toString('utf8');

    // process utf8 text chunk
});

This can be a problem for multi-byte characters like '\u00c4' which consists of two bytes: 0xC3 and 0x84. If the first byte is covered by the first chunk (Buffer) and the second byte by the second chunk then chunk.toString('utf8') will produce incorrect characters at the end/beginning of the text chunk. How can I avoid this?

Hint: I still need the buffer (more specifically the number of bytes in the buffer) to limit the number of downloaded bytes. So using res.setEncoding('utf8') like in the first example code above for non-compressed data does not suit my needs.

1
168
2/16/2015 9:33:32 AM

Accepted Answer

Single Buffer

If you have a single Buffer you can use its toString method that will convert all or part of the binary contents to a string using a specific encoding. It defaults to utf8 if you don't provide a parameter, but I've explicitly set the encoding in this example.

var req = http.request(reqOptions, function(res) {
    ...

    res.on('data', function(chunk) {
        var textChunk = chunk.toString('utf8');
        // process utf8 text chunk
    });
});

Streamed Buffers

If you have streamed buffers like in the question above where the first byte of a multi-byte UTF8-character may be contained in the first Buffer (chunk) and the second byte in the second Buffer then you should use a StringDecoder. :

var StringDecoder = require('string_decoder').StringDecoder;

var req = http.request(reqOptions, function(res) {
    ...
    var decoder = new StringDecoder('utf8');

    res.on('data', function(chunk) {
        var textChunk = decoder.write(chunk);
        // process utf8 text chunk
    });
});

This way bytes of incomplete characters are buffered by the StringDecoder until all required bytes were written to the decoder.

271
7/19/2016 7:20:06 PM

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon