Don't forget about Data Integrity when you think about Security

Matthew Gregory: Welcome to the podcast. Every week Mrinal, Glenn, and I get together to discuss technology, building products, secure by design, how Ockam works, and a lot more. We comment on industry macros across cloud, open source, and the security space. That brings us to what we're going to talk about today, which is that often people seem to have a qualification around the importance of their data and what they should be doing with it. They usually qualify the privacy and security of their data, and then build a posture or governance model around that. They talk about different technologies, whether something is encrypted or needs to be encrypted, or how they're encrypting it. I want to unpack that. Let's start with the first one, which is when people qualify the importance of their data and how they need to protect it or what they need to do to secure it, or whether they're encrypting it or not. A lot is that people think their data is far less important than it is and undervalue what happens when that data is used in their applications. Because those applications rely on that data to be truthful and to have integrity to do things with the data. With that, I'll kick it over to Glenn. “My data is not important,” why does this not make any sense?

Your Data is More Important than you Think

Glenn Gillen: My initial reaction is to ask, why are you collecting it then? If the data's not important, just stop doing that. Stop wasting cycles. I think people mean something else when they say that, the word important is probably the wrong one to use. The data might not be commercially sensitive, for example. It could be an aggregated metric stuff or it's public data that anyone could capture. It's not specific to your business or is public. If someone saw that data, you wouldn’t care. Often that is what people mean, they don't care if it's private, rather than the data is not important.

Mrinal Wadhwa: An interesting nuance is that there is the importance axis, and also what is considered data. Oftentimes people will think about what they are collecting as data, but they won’t think about the request to collect that data as data itself. The importance is in the fact that I'm observing some piece of information, collecting it, and then delivering it to a data store. However, the message that carries your data from the place you collected it to the place you store it is also data that is relevant. And that message is relevant to why you are collecting the information. Maybe you're building an AI model, where the data must always be correct when it's fed into that model. It's important because it's critical to whatever your business use case is that relies on that model.

Glenn Gillen: If you asked someone, “What if you replace your data with a random number generator.” Their reaction will be no, the information must be correct. Well, then your data is important. I think that's the best litmus test for what important means. If your data was wrong, would you care? The answer's always yes.

Matthew Gregory:

You're building an application that is going to consume this data and make decisions based on the data to produce some output. It's the process that you have built that's going to consume and transform the data. That is the important part. Even if the data is public, for example, we spoke to a wind farm operator who claimed their data was not important. It’s wind speed data, which is public. But they're making important business decisions, such as turning on and off the wind farm, and making forecasts on how much energy is being produced, based on that data. The applications in the data center need accurate information. So ergo the data is important. The data is important and the integrity of that data is critical because your application depends on it. Another thing we hear a lot is about the importance of data privacy. I think this goes actually in two directions. People either over-index on it or don't index on it. I think both are interesting to talk about. Privacy of data is a big topic right now, just in the press, with the new SEC rule where you have to disclose a data breach if you're a publicly traded company. Mrinal, if someone says, “My data doesn't need to be kept private.” What would be the other concerns that we would think about when we're talking about privacy of data? Even if we don't care that it leaked outside of our data center, what else should you be thinking about when you're thinking about the privacy of data?

Security is (alot) more than just Privacy

Mrinal Wadhwa: Let's take the case of “my data doesn’t need to be kept secret.” It's okay if everybody knows that I collected some information. What's usually not okay is the information I collected is incorrect. If that is important to you, then the tools that give the data properties of data integrity are important to you. So it might be okay that the data is revealed to someone, but you probably are not okay with the data being tampered, or the data not coming from the right source. So even when you don't care about the privacy, the secrecy, or the confidentiality of a piece of information, you still care about the integrity and authenticity of that information.

Matthew Gregory: Let's break this down to a specific example. There's some application living on the internet that's producing data, originating it, and it sends the data across the internet to a database that's going to store it before it eventually into an application where it will be processed. In this data producer-to-data-store scenario, we may not care if a malicious actor can see the data. But we do care about the integrity, authenticity, and originality of that data, why is that?

What can happen without Data Integrity

Mrinal Wadhwa: Let's say the message to the database is, “Write to the readings table that the observation is 500.” That's the message that's going over the wire, and we've already decided that the reading of 500 is not private information. But we still care that when that message reaches the data store, the value of 500 is in fact written to the readings table. Those two pieces of information, that it's being written to the right table and the value is 500, must remain the same from when the message started to when the message was received. We also care that only the authenticated data source can write to the reading table. We don’t want an attacker in the middle to be able to generate random messages that get stored in our database. It'll bump up our database bill, it will create garbage in our database that will affect our AI that learns on this database.If that happens, garbage gets fed into our system or we end up spending a lot of time and money dealing with the data. Both of those would be bad, and it just doesn't end there. The data store will acknowledge that it received the data, and an attacker in the middle could block that acknowledgment or incorrectly send it (i.e. send an acknowledgement when an action didn't happen).That would be bad too. If I can block the acknowledgment, the source keeps sending the same data over and over again. This has the same effect, many readings get created instead of one because I was able to block the acknowledgment. This results in more data getting stored in the database, or garbage getting stored. An attacker can also send incorrect acknowledgments. Even though the source said, “store the reading 500”, that reading never gets stored because the attacker in the middle sends a fake acknowledgment. What this means is, that even though we didn't care about the privacy of this flow of information, we still care about making sure only the correct source can write to the data store. We care that only the data store can acknowledge the fact that the data has been successfully written. So we care about the authenticity of these messages and the integrity of these messages because we don't want incorrect readings to be stored or stored in the wrong table.

Matthew Gregory: That's right. And to unpack that even further, if the message is in clear text when it goes over the wire, an attacker in the middle could know how to write to the database. They could perform man-in-the-middle attacks. Because they know what's happening between the application and data store, where the data came from, where it's going, what the message is, and how to write to it. It would be even worse if you have an unprotected credential in there as well. The punchline here is that privacy is not the only security concern, which brings us to unpacking ‘security’ as a word. And how people go about setting up security, thinking about data governance, and control of data. Glenn, could you talk about the spectrum of how people think about secure systems, in this distributed computing world, compared to the simpler days when our applications were all in the same box?

Glenn Gillen: I think that, as an industry, we don't care about privacy enough. The SEC ruling is forcing companies to think about it. But that said, when we start to think about security, we myopically think about privacy more often than not. There's a juxtaposition of, we're not thinking about privacy enough, but when we do think about it, we think about it too much at the expense of integrity and authenticity. Matt Johansson gave an example of how people think of security like visiting an emergency room. Something happens, you triage it, and you're done and you move on. What we witness time and time again is, that's not the case. Attackers get in and they lay low for a long time. These little things escalate over time. They'll sit there and they'll do some reconnaissance on your systems. His point was that security is more like a mental health condition. It's something that needs constant nurturing, and essentially therapy. You need to maintain it and keep yourself in a healthy state. It's not something you can just triage. People tend to think about it from a privacy perspective, at the expense of all the other things. Someone getting into your system and exposing your data is embarrassing and that's why it's human for us to focus on privacy. You'll get called out on it publicly. You impact a lot of people. The idea of someone getting in and quietly polluting your data for an extended period of time is quite terrifying to me these days. As an industry, we're a little bit asleep at the wheel in helping people understand those risks, especially if you're feeding data into an AI model. You're training a model on all of the data that you've been collecting for years and making business decisions based on it. If you can't absolutely trust all of the inputs into that system, the outputs will be garbage. We talked about the wind farm example before. That's a triage-type problem. If you get a wind speed rating of 900 knots and you've mistakenly shut down a turbine because it can't operate at that speed, you know immediately that a mistake has been made. If an attacker was instead quietly polluting the data with random variance, that data would be gone. You can't go back and fix it, you don’t know when the problem emerged, and you can't clean it. Your data is forever polluted. Your business intelligence is ruined. People think about security as just privacy, and they've conflated it for a long time because of the emotional attachment to privacy. There's a much bigger business risk looming, in my opinion, around integrity, authenticity, and control of the data.

Mrinal Wadhwa: Think about the amount of investment that goes into training an AI model, all of the data, and GPUs. If that data has been polluted, that can't be unrolled anymore. It's part of the model you trained and spent a lot of compute and storage training, for years. All of that becomes a wasted investment, or worse the results impact the action you take based on the model. There are also immediate strategic impacts. If the messages are being actively tampered with and there's an attacker somewhere in your environment, they can selectively tamper messages that are supposed to indicate various actions, such as delete a database or spin up a bunch of instances, all of those are pieces of data. If you don’t have guarantees around authenticity and integrity, those messages can cause unwanted action.

Why you need encryption (even when you don’t think you do)

Matthew Gregory: Let's unpack this, Mrinal. When someone says, “I don't need encryption.” We established that they should care about integrity and authenticity, so why do they also need encryption?

Mrinal Wadhwa: I think “I don't need to encrypt my data” usually comes from a place of, “I don't think my data needs to be private or secret,” and it's never because people don’t think their data needs to be correct. So people care about the correctness of this information. If you care about the integrity of the data, well it turns out that the mechanisms to have privacy, integrity, and authenticity guarantees work together. In Ockam’s case, Ockam secure channels provide those guarantees together. It’s more expensive and often less secure to decouple these properties, so they come as a package. You get data integrity, authenticity, and confidentiality as a package. When people are saying they don't want encryption, they are saying that it is okay if the data isn't kept secret or private, but they're not saying my data should not have data integrity and authenticity. Since they care about that, they need encryption because that’s the way you get a guarantee of integrity and authenticity.

Glenn Gillen: The flip side of that is someone whose data is already encrypted. From my experience as a web developer, often you’ll hear, “I'm using TLS, I've got TLS to the API.” The question is, what do you actually have there? You have a guarantee around privacy that only the server can read that information. That's what most TLS setups give you. But for a lot of apps, TLS terminates at the CDN because you don't want it to travel. You end up with a privacy guarantee for the length of that TLS connection to the CDN, and often ignore everything that happens behind the scenes and trust that your providers and intermediaries are not looking at the data and have put controls in place. And that doesn’t answer the question of integrity. Because all you have is privacy, and then we're back to Mrinal’s point; you care that it's correct and you want integrity guarantees. Then you need to verify who is sending the data. You could do mutual TLS, but often people are not doing that for all of their clients all of the time. And if you are, now you have to manage keys. So very quickly you're in a place where you think you have privacy, but it's a really small definition of privacy. You don’t have any guarantees around integrity.

Mrinal Wadhwa: Security is often defined from the perspective of whoever is controlling a certain system or is responsible for a certain asset. Let's say I'm the person responsible for making sure the data stored in that AI database is always correct. If that's the responsibility, I need to have control over who gets to store data in my database, and what they get to store. You can only have that control if you have the ability to know who sent a piece of information and whether what they sent is exactly what they sent. Was it tampered with along the path? You need control over the behavior of your system, what information is stored in it, et cetera to have that control.

You need authenticity of who is sending requests and who you are sending responses to. You need an integrity guarantee on what the requests and responses are. Since end-to-end encryption brings these properties together, I also get privacy guarantees. In some cases, you may care about privacy, in others you might not. But if you need your system to be secure, you need a tool that has these properties and they tend to come as a package.

Encryption is not Security

Matthew Gregory: It’s often a trap to think you are secure because your data is encrypted. The question is, where are we talking about? Is it data at rest in your database or while it's moving? Are you talking about a single TLS connection? Glenn, could you talk about this false narrative that having encryption means you are secure and that your data is private?

Glenn Gillen: TLS is an easy example to pick on because it feels like people have been given a checklist of stuff to go through. You have encryption in transit, encryption at rest, job done. Right? Let’s say I have a TLS connection from my desktop here to a pop in Melbourne. What happens behind the scenes? It’s the shared responsibility model, you speak to a vendor, look at their SOC2, compliance reports, and make sure it’s been audited. If anything in that supply chain gets compromised, you don't know what has been exposed and you don’t have control. When you outsource control to a cloud vendor, ultimately you are still responsible for security. You can have shared responsibility for it, but I think the easiest way to solve this is to focus on the two ends of the system that you can control. If you can control both of those ends of the systems, you can build solutions that are secure all the way through, no matter what that supply chain looks like, no matter what the network or the topology is. You can build highly trustful systems in that environment. The same thing goes for a managed service or message queue. Most of them are designed, intentionally, to have the data available in plain text through that system. It's not just a pipe, they're trying to provide analytics and build tooling around that. It’s part of the business model in a lot of those cases to have the data be plain text and visible in the system. That’s not what you want. You want integrity, control, and privacy all the way through your system. So you can tick a box and say, “We've got TLS.” But then you ignore the fact that it's plain text during this important, high-value moment. And then it's encrypted again on the way back out. That’s not private.

The Key to getting Encryption Right

Matthew Gregory: When we're talking to people about their architecture and encryption, the first thing we jump to is the keys. How and where were they created? How were they distributed? If you're generating symmetric keys for your encryption and sending them out into the world or building them into software, now you have a vulnerability around how the key was created and how it was distributed.

Mrinal Wadhwa: If someone's building a system and all the participants in that system have one key that stays the same forever, and you call an AES function to encrypt using that one key, you effectively have no encryption. It might feel like it, but it's not encryption. Calling AES is not encryption. Encryption needs to be end-to-end and needs to have a series of properties that have been studied for about 30 years now.

There are a ton of mistakes that can be made along the way. You have to think about the rotation of keys, there's a bunch of work around ratcheting keys, getting forward secrecy properties, getting properties that prevent impersonation, and replay attacks.

There’s a host of problems that come with managing keys. A single key everywhere is not the answer, but it's easy to convince yourself you did enough and you did it right. There are so many CVEs in the history of secure channel designs that are all about people thinking they did encryption correctly but made a mistake. So that's why a lot of these properties are now proven using formal models. For example, in Ockam’s secure channel design, we have formal proofs of various properties in our design. So there's a lot that goes into doing key management well.

Matthew Gregory: Yeah, it is the first question that gets asked after encryption. You said the data is encrypted, now we have to have a whole conversation about what that actually means. You may have done some things well, but that doesn't mean you have security, privacy, integrity, and authenticity. All these things go together. Using proven techniques is very important to build an architecture that is secure by design and has a low vulnerability surface. It's pretty difficult to do ad hoc. With that, I'll wrap up this podcast. That was a little insight into some of the things that the three of us talk about, hopefully, that was helpful. More to come and we'll see you later.

Last updated

Was this helpful?