Urban boyscout.Tech-whisperer.
545 stories

Incident report on memory leak caused by Cloudflare parser bug

1 Comment

Last Friday, Tavis Ormandy from Google’s Project Zero contacted Cloudflare to report a security problem with our edge servers. He was seeing corrupted web pages being returned by some HTTP requests run through Cloudflare.

It turned out that in some unusual circumstances, which I’ll detail below, our edge servers were running past the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data. And some of that data had been cached by search engines.

For the avoidance of doubt, Cloudflare customer SSL private keys were not leaked. Cloudflare has always terminated SSL connections through an isolated instance of NGINX that was not affected by this bug.

We quickly identified the problem and turned off three minor Cloudflare features (email obfuscation, Server-side Excludes and Automatic HTTPS Rewrites) that were all using the same HTML parser chain that was causing the leakage. At that point it was no longer possible for memory to be returned in an HTTP response.

Because of the seriousness of such a bug, a cross-functional team from software engineering, infosec and operations formed in San Francisco and London to fully understand the underlying cause, to understand the effect of the memory leakage, and to work with Google and other search engines to remove any cached HTTP responses.

Having a global team meant that, at 12 hour intervals, work was handed over between offices enabling staff to work on the problem 24 hours a day. The team has worked continuously to ensure that this bug and its consequences are fully dealt with. One of the advantages of being a service is that bugs can go from reported to fixed in minutes to hours instead of months. The industry standard time allowed to deploy a fix for a bug like this is usually three months; we were completely finished globally in under 7 hours with an initial mitigation in 47 minutes.

The bug was serious because the leaked memory could contain private information and because it had been cached by search engines. We have also not discovered any evidence of malicious exploits of the bug or other reports of its existence.

The greatest period of impact was from February 13 and February 18 with around 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulting in memory leakage (that’s about 0.00003% of requests).

We are grateful that it was found by one of the world’s top security research teams and reported to us.

This blog post is rather long but, as is our tradition, we prefer to be open and technically detailed about problems that occur with our service.

Parsing and modifying HTML on the fly

Many of Cloudflare’s services rely on parsing and modifying HTML pages as they pass through our edge servers. For example, we can insert the Google Analytics tag, safely rewrite http:// links to https://, exclude parts of a page from bad bots, obfuscate email addresses, enable AMP, and more by modifying the HTML of a page.

To modify the page, we need to read and parse the HTML to find elements that need changing. Since the very early days of Cloudflare, we’ve used a parser written using Ragel. A single .rl file contains an HTML parser used for all the on-the-fly HTML modifications that Cloudflare performs.

About a year ago we decided that the Ragel parser had become too complex to maintain and we started to write a new parser, named cf-html, to replace it. This streaming parser works correctly with HTML5 and is much, much faster and easier to maintain.

We first used this new parser for the Automatic HTTP Rewrites feature and have been slowly migrating functionality that uses the old Ragel parser to cf-html.

Both cf-html and the old Ragel parser are implemented as NGINX modules compiled into our NGINX builds. These NGINX filter modules parse buffers (blocks of memory) containing HTML responses, make modifications as necessary, and pass the buffers onto the next filter.

It turned out that the underlying bug that caused the memory leak had been present in our Ragel-based parser for many years but no memory was leaked because of the way the internal NGINX buffers were used. Introducing cf-html subtly changed the buffering which enabled the leakage even though there were no problems in cf-html itself.

Once we knew that the bug was being caused by the activation of cf-html (but before we knew why) we disabled the three features that caused it to be used. Every feature Cloudflare ships has a corresponding feature flag, which we call a ‘global kill’. We activated the Email Obfuscation global kill 47 minutes after receiving details of the problem and the Automatic HTTPS Rewrites global kill 3h05m later. The Email Obfuscation feature had been changed on February 13 and was the primary cause of the leaked memory, thus disabling it quickly stopped almost all memory leaks.

Within a few seconds, those features were disabled worldwide. We confirmed we were not seeing memory leakage via test URIs and had Google double check that they saw the same thing.

We then discovered that a third feature, Server-Side Excludes, was also vulnerable and did not have a global kill switch (it was so old it preceded the implementation of global kills). We implemented a global kill for Server-Side Excludes and deployed a patch to our fleet worldwide. From realizing Server-Side Excludes were a problem to deploying a patch took roughly three hours. However, Server-Side Excludes are rarely used and only activated for malicious IP addresses.

Root cause of the bug

The Ragel code is converted into generated C code which is then compiled. The C code uses, in the classic C manner, pointers to the HTML document being parsed, and Ragel itself gives the user a lot of control of the movement of those pointers. The underlying bug occurs because of a pointer error.

/* generated code */
if ( ++p == pe )
    goto _test_eof;

The root cause of the bug was that reaching the end of a buffer was checked using the equality operator and a pointer was able to step past the end of the buffer. This is known as a buffer overrun. Had the check been done using >= instead of == jumping over the buffer end would have been caught. The equality check is generated automatically by Ragel and was not part of the code that we wrote. This indicated that we were not using Ragel correctly.

The Ragel code we wrote contained a bug that caused the pointer to jump over the end of the buffer and past the ability of an equality check to spot the buffer overrun.

Here’s a piece of Ragel code used to consume an attribute in an HTML <script> tag. The first line says that it should attempt to find zero of more unquoted_attr_char followed by (that’s the :>> concatenation operator) whitespace, forward slash or then > signifying the end of the tag.

script_consume_attr := ((unquoted_attr_char)* :>> (space|'/'|'>'))
>{ ddctx("script consume_attr"); }
@{ fhold; fgoto script_tag_parse; }
$lerr{ dd("script consume_attr failed");
       fgoto script_consume_attr; };

If an attribute is well-formed, then the Ragel parser moves to the code inside the @{ } block. If the attribute fails to parse (which is the start of the bug we are discussing today) then the $lerr{ } block is used.

For example, in certain circumstances (detailed below) if the web page ended with a broken HTML tag like this:

<script type=

the $lerr{ } block would get used and the buffer would be overrun. In this case the $lerr does dd(“script consume_attr failed”); (that’s a debug logging statement that is a nop in production) and then does fgoto script_consume_attr; (the state transitions to script_consume_attr to parse the next attribute).
From our statistics it appears that such broken tags at the end of the HTML occur on about 0.06% of websites.

If you have a keen eye you may have noticed that the @{ } transition also did a fgoto but right before it did fhold and the $lerr{ } block did not. It’s the missing fhold that resulted in the memory leakage.

Internally, the generated C code has a pointer named p that is pointing to the character being examined in the HTML document. fhold is equivalent to p-- and is essential because when the error condition occurs p will be pointing to the character that caused the script_consume_attr to fail.

And it’s doubly important because if this error condition occurs at the end of the buffer containing the HTML document then p will be after the end of the document (p will be pe + 1 internally) and a subsequent check that the end of the buffer has been reached will fail and p will run outside the buffer.

Adding an fhold to the error handler fixes the problem.

Why now

That explains how the pointer could run past the end of the buffer, but not why the problem suddenly manifested itself. After all, this code had been in production and stable for years.

Returning to the script_consume_attr definition above:

script_consume_attr := ((unquoted_attr_char)* :>> (space|'/'|'>'))
>{ ddctx("script consume_attr"); }
@{ fhold; fgoto script_tag_parse; }
$lerr{ dd("script consume_attr failed");
       fgoto script_consume_attr; };

What happens when the parser runs out of characters to parse while consuming an attribute differs whether the buffer currently being parsed is the last buffer or not. If it’s not the last buffer, then there’s no need to use $lerr as the parser doesn’t know whether an error has occurred or not as the rest of the attribute may be in the next buffer.

But if this is the last buffer, then the $lerr is executed. Here’s how the code ends up skipping over the end-of-file and running through memory.

The entry point to the parsing function is ngx_http_email_parse_email (the name is historical, it does much more than email parsing).

ngx_int_t ngx_http_email_parse_email(ngx_http_request_t *r, ngx_http_email_ctx_t *ctx) {
    u_char  *p = ctx->pos;
    u_char  *pe = ctx->buf->last;
    u_char  *eof = ctx->buf->last_buf ? pe : NULL;

You can see that p points to the first character in the buffer, pe to the character after the end of the buffer and eof is set to pe if this is the last buffer in the chain (indicated by the last_buf boolean), otherwise it is NULL.

When the old and new parsers are both present during request handling a buffer such as this will be passed to the function above:

(gdb) p *in->buf
$8 = {
  pos = 0x558a2f58be30 "<script type=\"",
  last = 0x558a2f58be3e "",


  last_buf = 1,


Here there is data and last_buf is 1. When the new parser is not present the final buffer that contains data looks like this:

(gdb) p *in->buf
$6 = {
  pos = 0x558a238e94f7 "<script type=\"",
  last = 0x558a238e9504 "",


  last_buf = 0,


A final empty buffer (pos and last both NULL and last_buf = 1) will follow that buffer but ngx_http_email_parse_email is not invoked if the buffer is empty.

So, in the case where only the old parser is present, the final buffer that contains data has last_buf set to 0. That means that eof will be NULL. Now when trying to handle script_consume_attr with an unfinished tag at the end of the buffer the $lerr will not be executed because the parser believes (because of last_buf) that there may be more data coming.

The situation is different when both parsers are present. last_buf is 1, eof is set to pe and the $lerr code runs. Here’s the generated code for it:

/* #line 877 "ngx_http_email_filter_parser.rl" */
{ dd("script consume_attr failed");
              {goto st1266;} }
     goto st0;


    if ( ++p == pe )
        goto _test_eof1266;

The parser runs out of characters while trying to perform script_consume_attr and p will be pe when that happens. Because there’s no fhold (that would have done p--) when the code jumps to st1266 p is incremented and is now past pe.

It then won’t jump to _test_eof1266 (where EOF checking would have been performed) and will carry on past the end of the buffer trying to parse the HTML document.

So, the bug had been dormant for years until the internal feng shui of the buffers passed between NGINX filter modules changed with the introduction of cf-html.

Going bug hunting

Research by IBM in the 1960s and 1970s showed that bugs tend to cluster in what became known as “error-prone modules”. Since we’d identified a nasty pointer overrun in the code generated by Ragel it was prudent to go hunting for other bugs.

Part of the infosec team started fuzzing the generated code to look for other possible pointer overruns. Another team built test cases from malformed web pages found in the wild. A software engineering team began a manual inspection of the generated code looking for problems.

At that point it was decided to add explicit pointer checks to every pointer access in the generated code to prevent any future problem and to log any errors seen in the wild. The errors generated were fed to our global error logging infrastructure for analysis and trending.

#define SAFE_CHAR ({\
    if (!__builtin_expect(p < pe, 1)) {\
        ngx_log_error(NGX_LOG_CRIT, r->connection->log, 0, "email filter tried to access char past EOF");\
        output_flat_saved(r, ctx);\
        return NGX_ERROR;\

And we began seeing log lines like this:

2017/02/19 13:47:34 [crit] 27558#0: *2 email filter tried to access char past EOF while sending response to client, client:, server: localhost, request: "GET /malformed-test.html HTTP/1.1”

Every log line indicates an HTTP request that could have leaked private memory. By logging how often the problem was occurring we hoped to get an estimate of the number of times HTTP request had leaked memory while the bug was present.

In order for the memory to leak the following had to be true:

The final buffer containing data had to finish with a malformed script or img tag
The buffer had to be less than 4k in length (otherwise NGINX would crash)
The customer had to either have Email Obfuscation enabled (because it uses both the old and new parsers as we transition),
… or Automatic HTTPS Rewrites/Server Side Excludes (which use the new parser) in combination with another Cloudflare feature that uses the old parser. … and Server-Side Excludes only execute if the client IP has a poor reputation (i.e. it does not work for most visitors).

That explains why the buffer overrun resulting in a leak of memory occurred so infrequently.

Additionally, the Email Obfuscation feature (which uses both parsers and would have enabled the bug to happen on the most Cloudflare sites) was only enabled on February 13 (four days before Tavis’ report).

The three features implicated were rolled out as follows. The earliest date memory could have leaked is 2016-09-22.

2016-09-22 Automatic HTTP Rewrites enabled
2017-01-30 Server-Side Excludes migrated to new parser
2017-02-13 Email Obfuscation partially migrated to new parser
2017-02-18 Google reports problem to Cloudflare and leak is stopped

The greatest potential impact occurred for four days starting on February 13 because Automatic HTTP Rewrites wasn’t widely used and Server-Side Excludes only activate for malicious IP addresses.

Internal impact of the bug

Cloudflare runs multiple separate processes on the edge machines and these provide process and memory isolation. The memory being leaked was from a process based on NGINX that does HTTP handling. It has a separate heap from processes doing SSL, image re-compression, and caching, which meant that we were quickly able to determine that SSL private keys belonging to our customers could not have been leaked.

However, the memory space being leaked did still contain sensitive information. One obvious piece of information that had leaked was a private key used to secure connections between Cloudflare machines.

When processing HTTP requests for customers’ web sites our edge machines talk to each other within a rack, within a data center, and between data centers for logging, caching, and to retrieve web pages from origin web servers.

In response to heightened concerns about surveillance activities against Internet companies, we decided in 2013 to encrypt all connections between Cloudflare machines to prevent such an attack even if the machines were sitting in the same rack.

The private key leaked was the one used for this machine to machine encryption. There were also a small number of secrets used internally at Cloudflare for authentication present.

External impact and cache clearing

More concerning was that fact that chunks of in-flight HTTP requests for Cloudflare customers were present in the dumped memory. That meant that information that should have been private could be disclosed.

This included HTTP headers, chunks of POST data (perhaps containing passwords), JSON for API calls, URI parameters, cookies and other sensitive information used for authentication (such as API keys and OAuth tokens).

Because Cloudflare operates a large, shared infrastructure an HTTP request to a Cloudflare web site that was vulnerable to this problem could reveal information about an unrelated other Cloudflare site.

An additional problem was that Google (and other search engines) had cached some of the leaked memory through their normal crawling and caching processes. We wanted to ensure that this memory was scrubbed from search engine caches before the public disclosure of the problem so that third-parties would not be able to go hunting for sensitive information.

Our natural inclination was to get news of the bug out as quickly as possible, but we felt we had a duty of care to ensure that search engine caches were scrubbed before a public announcement.

The infosec team worked to identify URIs in search engine caches that had leaked memory and get them purged. With the help of Google, Yahoo, Bing and others, we found 770 unique URIs that had been cached and which contained leaked memory. Those 770 unique URIs covered 161 unique domains. The leaked memory has been purged with the help of the search engines.

We also undertook other search expeditions looking for potentially leaked information on sites like Pastebin and did not find anything.

Some lessons

The engineers working on the new HTML parser had been so worried about bugs affecting our service that they had spent hours verifying that it did not contain security problems.

Unfortunately, it was the ancient piece of software that contained a latent security problem and that problem only showed up as we were in the process of migrating away from it. Our internal infosec team is now undertaking a project to fuzz older software looking for potential other security problems.

Detailed Timeline

We are very grateful to our colleagues at Google for contacting us about the problem and working closely with us through its resolution. All of which occurred without any reports that outside parties had identified the issue or exploited it.

All times are UTC.

2017-02-18 0011 Tweet from Tavis Ormandy asking for Cloudflare contact information
2017-02-18 0032 Cloudflare receives details of bug from Google
2017-02-18 0040 Cross functional team assembles in San Francisco
2017-02-18 0119 Email Obfuscation disabled worldwide
2017-02-18 0122 London team joins
2017-02-18 0424 Automatic HTTPS Rewrites disabled worldwide
2017-02-18 0722 Patch implementing kill switch for cf-html parser deployed worldwide

2017-02-20 2159 SAFE_CHAR fix deployed globally

2017-02-21 1803 Automatic HTTPS Rewrites, Server-Side Excludes and Email Obfuscation re-enabled worldwide

NOTE: This post was updated to reflect updated information.

Read the whole story
23 hours ago
that's quite a mea culpa.
San Francisco, CA
Share this story

[Premiere] Worthy – A Little Weird

1 Comment

New UK based label Strangelove heads out of the gate at breakneck speed March 3rd with their smashing inaugural release “A Little Weird EP” which comes from prolific SF-based producer, Anabatic Records founder, Dirtybird affiliate and Discobelle fam Worthy.

The title track is offered right here as a Discobelle exclusive premiere and it’s an hypnotic, acid bleeped out, bass filled techy monster that will leave clubs in shambles.

Read the whole story
1 day ago
nice one @w_o_r_t_h_y
San Francisco, CA
Share this story

The consequences of refusing to unlock your phone at the U.S. border ↦


by Dan Moren

Over at Ars Technica, our former Macworld colleague Cyrus Farivar has put together a look at exactly what might happen if you’re asked to unlock your phone at the U.S. border and you refuse:

He concluded: “If I was asked to unlock my phone or computer by border officials today, I would politely say no, ask for an attorney, and deal with the consequences from there.”

However, in 2015, a federal judge in the District of Columbia ruled in favor of a South Korean businessman who has his laptop seized at Los Angeles International Airport, and searched without a warrant.

The ACLU’s Nathan Freed Wessler, who noted he has personally been sent to secondary screening but has never been asked about his own electronic devices at the border, added that this puts travelers in a “tough spot” between balancing their privacy rights and their ability to get where they are going.

The fact that we have to even think about this issue is unsettling, but we live in unsettling times. Electronic devices carry a lot of our personal information, which is exactly why the government wants to look at them, but it doesn’t mean that they should be given carte blanche to dig through our personal correspondence, private pictures, and so on.

[Read on Six Colors.]

Read the whole story
1 day ago
I'd still like to see duress codes, because there's always the chance you'll be caught off guard. And I've seen folks say to not bring a smart phone at all while traveling, which seems a bit extreme, but I suppose if you want to minimize delays, that may be your best bet.
San Francisco, CA
7 days ago
Here is a place where Siri could be useful. Have a special word (user-defined in voice training) trigger Siri to put the phone into lockdown mode. This would disable Touch ID just in case unscrupulous detainers wanted to cuff someone and then push the detainee's phone to the finger behind their back. Before traveling I always disable Touch ID, and since I've got a 10 digit passcode the phone is a no-go for any brute force attacks. I've even had TSA folks comment on my passcode and my response is the same every time "you made me do this, I certainly don't want a 10 digit code."
Space City, USA
7 days ago
you could also just turn off your phone. touchid is disabled on boot. you'd probably want to do it while deplaning so it can't be construed as obstruction.
1 day ago
hat tip to chrisrosa for the KISS method requiring no further action on Apple's part.
Share this story

No Small Parts: James Hong

1 Share

No Small Parts: James Hong


Brandon Hardesty presents what could be his proudest moment – an exploration of the phenomenally prolific, multitalented actor James Hong, who has played over 400 roles – from dramas, to sitcoms, to video games, to animation, and at age 88 is still going strong.

Read the whole story
3 days ago
San Francisco, CA
Share this story

Listen to Jidenna's Debut Studio Album 'The Chief'

1 Comment
Jidenna has released his first studio album 'The Chief,' including features by Quavo and and Janelle Monáe.

Read the whole story
7 days ago
Luke Cage Massive!
San Francisco, CA
Share this story

50 States of [McMansion] Hell: Redding, CA

1 Comment and 2 Shares

Hello everyone. I am late, but I hope to make up with what may be the greatest thing I’ve ever seen in my life. It’s so great it’s going to give me nightmares for the next three weeks. I had a dream once in the 5th grade that I went swimming with Lemony Snicket in a black pond that led to a furnace instead of a waterfall. That is perhaps the only scary dream I can remember verbatim, until now

Before I show you this house, I need to introduce with what the Realtor®™ has to say:


When a Realtor™® writes a description with not a single word in all caps, you know something is very, very wrong. 

I want to put a disclaimer here: of all the houses I’ve ever done, this is the only one that’s had less than 10 days on Zillow. Since I am more terrified of realtors than Aunt Josephine, I would like to take the time to articulate my copyright disclaimer before anyone dare pull the trigger on this bad-boy. 

Copyright Disclaimer: All photographs in this post are from real estate aggregate Zillow.com and are used in this post for the purposes of education, satire, and parody, consistent with 17 USC §107.

Now back to business. 

Our charmer is a 5 bed, 4 bath stunner built in 1988. 


wait we need to go deeper 


We are literally looking at a trash house. The front door is a sickly coo-coo clock articulated in dung. Several of the windows are crooked. I don’t know if that can properly described as a frieze along the roofline, as I’m not an expert in trash classicism. 

The Lawyer Foyer


Do you know how hard it is to refrain from just talking about adult entertainment with this house? This house has its own obscenity laws. 

The Living Thing


Pretty sure the floor is actually lava in that next room. 

Dining Room?


Yeah, this is about the level of discomfort I felt going to my first college party. Probably stickier though. 

The Kitchen


Okay, I have lived in some grody apartments over the years, but nothing has had as much grode as this house costing around half a million dollars. I’m having some crippling self-doubt about my upward mobility these days. 

Bathroom 1


Maybe it was summer camp when they talked about Athlete’s foot. Puberty has permanently blocked out my memory. 

Bedroom 1


Am I too old to use smh? 

Bathroom 2


What’s unsettling about the toilet? I can’t tell what that stain is under the lid nor can I assume what kind of people would purchase such a device. 

Bedroom 2


God, you know someone died in there. Also, is it just me or could this image replaced with some other text totally be the cover of a really bad shoegaze album c.1996? 

Bathroom 4



Bathroom 5


Things men could put in those drawers:
- half-used pomade tins
- assorted toothbrushes
- needlessly gendered jokes about men by women 





***gets to crime scene***

“INSTANT crime test is positive for roofies. Who would do such a thing to such a promising young girl?????stares at camera somberly
“I don’t know sensitive but strong female agent but we’ll get to the bottom of it.” PULLS OUT HYPER-IPAD. 
you know what? this joke is over, I’m done. 

Finally, the rear elevation:


the poetic justice is oozing from the walls. 

Well, that does it for California. I know I said I was going to do a double feature. I know I said it with my own keyboard. But here is the truth, friends: I got off the bus back from North Carolina at 5PM. I am exhausted, and honestly, a little :(. I’m excited to get to STRAYA tomorrow, despite the :(. I want to move there. 


If you like this post, and want to see more like it (plus get sweet access to things like stickers and behind the scenes stuff), consider supporting me on Patreon! Not into recurring donations? Check out the McMansion Hell Store - 30% goes to charity.

Copyright Disclaimer: All photographs in this post are from real estate aggregate Zillow.com and are used in this post for the purposes of education, satire, and parody, consistent with 17 USC §107.

Read the whole story
12 days ago
San Francisco, CA
Share this story
Next Page of Stories