Grayscale Release System OOM Troubleshooting

ayou - Aug 12 - - Dev Community

Previously, in this article, we discussed how to implement a grayscale release system for Node.js services using a process-based approach. The code snippet below demonstrates the concept:

// index.js
const cp = require('child_process')
const url = require('url')
const http = require('http')

const child1 = cp.fork('child.js', [], {
  env: {PORT: 3000},
})
const child2 = cp.fork('child.js', [], {
  env: {PORT: 3001},
})

function afterChildrenReady() {
  let readyN = 0
  let _resolve

  const p = new Promise((resolve) => {
    _resolve = resolve
  })

  const onReady = (msg) => {
    if (msg === 'ready') {
      if (++readyN === 2) {
        _resolve()
      }
    }
  }

  child1.on('message', onReady)
  child2.on('message', onReady)

  return p
}

const httpServer = http.createServer(function (req, res) {
  const query = url.parse(req.url, true).query

  if (query.version === 'v1') {
    http.get('http://localhost:3000', (proxyRes) => {
      proxyRes.pipe(res)
    })
  } else {
    http.get('http://localhost:3001', (proxyRes) => {
      proxyRes.pipe(res)
    })
  }
})

afterChildrenReady().then(() => {
  httpServer.listen(8000, () => console.log('Start http server on 8000'))
})

// child.js
const http = require('http')

const httpServer = http.createServer(function (req, res) {
  res.writeHead(200, {'Content-Type': 'text/plain'})
  setTimeout(() => {
    res.end('handled by child, pid is ' + process.pid + '\n')
  }, 1000)
})

httpServer.listen(process.env.PORT, () => {
  process.send && process.send('ready')
  console.log(`Start http server on ${process.env.PORT}`)
})
Enter fullscreen mode Exit fullscreen mode

In summary, when running index.js, two child processes are forked. The main process determines which child process to proxy based on the request parameters, enabling different users to see different content.

However, due to the additional layer of proxying, the performance of the service is affected. To optimize it, we can consider reusing TCP connections by using the agent parameter when making http.request calls (for more details, see Using HTTP Agent for keep-alive in Node.js). However, this optimization caused issues with the service.

Let's simulate the scenario. First, modify the code snippet above to enable TCP connection reuse and limit it to one connection:

const agent = http.Agent({keepAlive: true, maxSockets: 1})

const httpServer = http.createServer(function (req, res) {
  const query = url.parse(req.url, true).query

  if (query.version === 'v1') {
    http.get('http://localhost:3000', {agent}, (proxyRes) => {
      proxyRes.pipe(res)
    })
  } else {
    http.get('http://localhost:3001', {agent}, (proxyRes) => {
      proxyRes.pipe(res)
    })
  }
})
Enter fullscreen mode Exit fullscreen mode

Next, we use autocannon -c 400 -d 100 http://localhost:8000 to perform load testing.

The test results reveal the following:

  • During the load test, accessing http://localhost:8000 times out.
  • During the load test, the memory usage rapidly increases.
  • After the load test, accessing http://localhost:8000 still times out, but the memory usage gradually decreases. It takes a while before a response is received.

We can think of TCP connections as railway tracks, where each HTTP content is divided into multiple train cars transported on these tracks:

Image description

Due to having only one route between the proxy and server, when client requests arrive too quickly, they need to wait in a queue for processing:

Image description

This explains why requests time out during the load test.

Furthermore, because the proxy generates many "requests" in the queue, memory usage rapidly increases. This can be further analyzed using Node.js's inspect feature.

To utilize this feature, add the --inspect parameter when starting the Node.js process. For the child processes started with fork, use execArgv to specify the parameter, as shown below:

const child1 = cp.fork('child.js', [], {
  env: {PORT: 3000},
  execArgv: ['--inspect=9999'],
})
const child2 = cp.fork('child.js', [], {
  env: {PORT: 3001},
  execArgv: ['--inspect=9998'],
})
Enter fullscreen mode Exit fullscreen mode

Then, open the Chrome DevTools, click on Node.js's DevTools, and add three connections to observe the following:

Image description

Image description

Image description

Here, we only focus on the master process. First, take a memory snapshot, then start the load testing script, and after running for a while, take another memory snapshot. Compare the two snapshots, and the results are as follows:

Image description

We can see that there are indeed many new ClientRequest instances between the two snapshots, confirming our previous speculation.

After the load test, even though no additional requests enter the proxy, due to the backlog of requests and the artificial 1-second delay added to each response in child.js, processing these backlogged requests becomes very slow. This explains why memory usage slowly decreases and it takes a while before a response is received.

The saying "Don't optimize prematurely" is a valuable lesson in software development, and we truly experienced it this time, especially when dealing with optimization techniques that we only partially understood. The root cause of this issue was our assumption that we had some knowledge about Node.js's Agent, which turned out to be unnecessary, and we didn't carefully analyze its impact or conduct detailed performance testing.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player