Jekyll2022-12-19T15:57:22+00:00http://andrewmccall.com/feed.xmlandrewmccall.comHow Bad Data Happens2019-01-26T13:17:00+00:002019-01-26T13:17:00+00:00http://andrewmccall.com/data-quality-how-bad-data-happens<p>A lot of what we do as data engineers is fix things that have broken. Either we’ve been alerted to something having failed with <em>that</em> job again. Or some user or another is telling us that some value for a record is wrong.</p>
<h1 id="how-did-we-get-here">How did we get here?</h1>
<p>When you’re writing a web service that takes a value, you can force the user to give you one or reject their request as invalid. Data, is different, imagine you’ve got a warehouse that has data from a few micro-services and downstream some reporting.</p>
<p><img src="/images/post_images/how-bad-data-happens-ms.png" alt="Some Hypothetical System" title="Example system" /></p>
<p>We have a user service that holds some details about a user and a widget service that maybe monitors some IoT devices and sends an alert to the user’s mobile number if a widget stops working.</p>
<p>If the data from your user service comes in without a mobile number, what do you do? It doesn’t really matter how the required field got to be null, in the upstream and there are plenty of reasons it can and <em>will</em> happen, but for the sake of this scenario let’s assume it was because a developer introduced a <em>“bug.”</em></p>
<p><em>I quote bug there, but it’s equally likely that what the developer introduced wasn’t a bug at all. The required field may have been required before but due to other system changes isn’t needed for some use cases anymore.</em></p>
<p>Now you have data flowing from one of the micro-services which is invalid. How do you handle that?</p>
<p>Well, you’ve got a few options, none ideal:</p>
<ul>
<li>Accept the broken data.</li>
<li>Drop the record and continue</li>
<li>Fix the data in flight.</li>
</ul>
<p>I’m going to argue for all sorts of reasons (but not here) your least worst option is to maintain consistency with the source at least for ingestion purposes. So the last two in my opinion are out.</p>
<p>So what now? You’ve got a record that you’ve accepted as broken… before we get there… Do you even know?</p>
<h1 id="how-do-you-measure-data-quality">How do you measure data quality?</h1>
<p>Many organisations have some scripts that are run by an engineering team that let them monitor jobs for errors. Some maybe even have some more advanced reconciliation scripts they can run as part of the process to check the data. This is mostly for the consumption of the engineering teams monitoring the jobs and allows them to be notified that something has gone wrong.</p>
<p>These are generally good at spotting data that may be missing, but usually not so good at implementing business rules around the context, and more often than not consumers of the data are completely in the dark.</p>
<p>So most organisations, let’s be honest, only notice data quality issues when someone else notices and complains. It could be an analyst or even an end user. This is a pretty broken and pretty sorry state of affairs.</p>
<h1 id="tooling-is-there-light-at-the-end-of-the-tunnel">Tooling, is there light at the end of the tunnel?</h1>
<p>Some organisations are starting to adopt DQ tools which do some form of data profiling. My problem with these tools is they’re focusing on data quality as a governance function, the same way we treated testing and QA decades ago. As something to centralise the rules around across a whole organisation so they can be applied consistently to all data. The idea being that if all data follows the same rules you can apply the same rules across the board, enforce them and even fix the data in flight.</p>
<p>I don’t only think this is the wrong way to do things, I think it’s a fantasy to think it’s even possible.</p>
<p><img src="/images/post_images/how-bad-data-happens-ms.png" alt="Some Hypothetical System" title="Example system" /></p>
<p>Let’s go back to our really simple example, I need to have a mobile number for a user so I can send them a text message when a widget has a problem. One potential <em>“bug”</em> a developer above could introduce would be to add groups and types of users. Now I can send an alert about a widget to multiple users and I can create a new type of user that is only concerned with billing. Now we have different user types and we really don’t need a mobile number for users concerned with just billing, so the developer makes the field optional. All her tests pass and the code works and being ever diligent and thorough she’s even added checks to prevent setting up alerts to users without mobile phones.</p>
<p>How is this a problem? Well keeping with the example maybe there is a KPI and report around alerts per user or alerts per widget. Maybe it’s on a monthly dashboard or on an exec pack and there is no context around it and our backend and frontend teams have no idea it even exists. Let’s say we’re launching the new users features with a subset of users. Our team of decision makers will see one of two things:</p>
<ul>
<li>The number of alerts per widget in the system has increased! This feature is on fire, let’s launch it!</li>
<li>The number of alerts per user is lower, our users must hate this feature let’s can it.</li>
</ul>
<p>Depending on your business model, charging per seat or per alert one or both of these results aren’t going to give you an accurate view of the world. Which will lead to bad decisions.</p>
<h2 id="would-a-dq-tool-help">Would a DQ tool help?</h2>
<p>It depends, if the quality rules enforced the fact that a user had to have a mobile number the feature may have failed testing. But there are some assumptions and problems here.</p>
<p>First the DQ tools and rules would have to exist and have to be in sync between a production and staging environment. This would add development complexity and slow the development teams down when they encountered something like this, likely late in a testing cycle.</p>
<p>They’d have to reach out to a central team and figure out how they could get the rules changed, the DQ team would need to know where a rule was used and why and would likely need to be part of every new feature. They’d also need to get whoever the report owner was to look at making changes to the report.</p>
<p>More hand-offs, more meetings and someone would be too busy.</p>
<p>There’d be no way for the tool to “clean the data” in flight, it simply doesn’t exist so all it could do would be reject the data forcing users to give a mobile phone number when one wouldn’t ever be used.</p>
<h1 id="a-better-way">A better way?</h1>
<p>Data quality should be something we treat like metrics we gather for other services. You track memory usage, latency and requests per second or micro-services and when things start to go wrong someone gets a notification. It’s not a central function, but part of the job for engineers or SREs.</p>
<p>In a data world this should be part of the job for engineers, analysts and data scientists. What we really need is tools that rather than trying to enforce business rules give us better insight into the current state of data.</p>
<p>Just like we’ve moved data democratization tools towards empowering consumers of data to discover and remix data, analytics and insights we should do the same with data quality. It’s not just for engineers or governance it should be part of everything we do.</p>
<p>Call it DataOps or something else, the tooling and mindset does need to shift more towards how we’ve shifted with software development in general, move fast but know when you break things.</p>Andrew McCallA lot of what we do as data engineers is fix things that have broken. Either we’ve been alerted to something having failed with that job again. Or some user or another is telling us that some value for a record is wrong. How did we get here? When you’re writing a web service that takes a value, you can force the user to give you one or reject their request as invalid. Data, is different, imagine you’ve got a warehouse that has data from a few micro-services and downstream some reporting. We have a user service that holds some details about a user and a widget service that maybe monitors some IoT devices and sends an alert to the user’s mobile number if a widget stops working. If the data from your user service comes in without a mobile number, what do you do? It doesn’t really matter how the required field got to be null, in the upstream and there are plenty of reasons it can and will happen, but for the sake of this scenario let’s assume it was because a developer introduced a “bug.” I quote bug there, but it’s equally likely that what the developer introduced wasn’t a bug at all. The required field may have been required before but due to other system changes isn’t needed for some use cases anymore. Now you have data flowing from one of the micro-services which is invalid. How do you handle that? Well, you’ve got a few options, none ideal: Accept the broken data. Drop the record and continue Fix the data in flight. I’m going to argue for all sorts of reasons (but not here) your least worst option is to maintain consistency with the source at least for ingestion purposes. So the last two in my opinion are out. So what now? You’ve got a record that you’ve accepted as broken… before we get there… Do you even know? How do you measure data quality? Many organisations have some scripts that are run by an engineering team that let them monitor jobs for errors. Some maybe even have some more advanced reconciliation scripts they can run as part of the process to check the data. This is mostly for the consumption of the engineering teams monitoring the jobs and allows them to be notified that something has gone wrong. These are generally good at spotting data that may be missing, but usually not so good at implementing business rules around the context, and more often than not consumers of the data are completely in the dark. So most organisations, let’s be honest, only notice data quality issues when someone else notices and complains. It could be an analyst or even an end user. This is a pretty broken and pretty sorry state of affairs. Tooling, is there light at the end of the tunnel? Some organisations are starting to adopt DQ tools which do some form of data profiling. My problem with these tools is they’re focusing on data quality as a governance function, the same way we treated testing and QA decades ago. As something to centralise the rules around across a whole organisation so they can be applied consistently to all data. The idea being that if all data follows the same rules you can apply the same rules across the board, enforce them and even fix the data in flight. I don’t only think this is the wrong way to do things, I think it’s a fantasy to think it’s even possible. Let’s go back to our really simple example, I need to have a mobile number for a user so I can send them a text message when a widget has a problem. One potential “bug” a developer above could introduce would be to add groups and types of users. Now I can send an alert about a widget to multiple users and I can create a new type of user that is only concerned with billing. Now we have different user types and we really don’t need a mobile number for users concerned with just billing, so the developer makes the field optional. All her tests pass and the code works and being ever diligent and thorough she’s even added checks to prevent setting up alerts to users without mobile phones. How is this a problem? Well keeping with the example maybe there is a KPI and report around alerts per user or alerts per widget. Maybe it’s on a monthly dashboard or on an exec pack and there is no context around it and our backend and frontend teams have no idea it even exists. Let’s say we’re launching the new users features with a subset of users. Our team of decision makers will see one of two things: The number of alerts per widget in the system has increased! This feature is on fire, let’s launch it! The number of alerts per user is lower, our users must hate this feature let’s can it. Depending on your business model, charging per seat or per alert one or both of these results aren’t going to give you an accurate view of the world. Which will lead to bad decisions. Would a DQ tool help? It depends, if the quality rules enforced the fact that a user had to have a mobile number the feature may have failed testing. But there are some assumptions and problems here. First the DQ tools and rules would have to exist and have to be in sync between a production and staging environment. This would add development complexity and slow the development teams down when they encountered something like this, likely late in a testing cycle. They’d have to reach out to a central team and figure out how they could get the rules changed, the DQ team would need to know where a rule was used and why and would likely need to be part of every new feature. They’d also need to get whoever the report owner was to look at making changes to the report. More hand-offs, more meetings and someone would be too busy. There’d be no way for the tool to “clean the data” in flight, it simply doesn’t exist so all it could do would be reject the data forcing users to give a mobile phone number when one wouldn’t ever be used. A better way? Data quality should be something we treat like metrics we gather for other services. You track memory usage, latency and requests per second or micro-services and when things start to go wrong someone gets a notification. It’s not a central function, but part of the job for engineers or SREs. In a data world this should be part of the job for engineers, analysts and data scientists. What we really need is tools that rather than trying to enforce business rules give us better insight into the current state of data. Just like we’ve moved data democratization tools towards empowering consumers of data to discover and remix data, analytics and insights we should do the same with data quality. It’s not just for engineers or governance it should be part of everything we do. Call it DataOps or something else, the tooling and mindset does need to shift more towards how we’ve shifted with software development in general, move fast but know when you break things.Hive, managed vs external tables2017-10-27T18:00:00+00:002017-10-27T18:00:00+00:00http://andrewmccall.com/hive-managed-tables<p>One of the things that comes up often in conversations about Hive is using
managed vs. external tables.</p>
<h1 id="what-are-managed-tables">What are managed tables?</h1>
<p>Managed tables are Hive tables where Hive manages the data; Hive stores the
data internally in it’s own warehouse directory and generally you wouldn’t
interact with the data directly.</p>
<p>On of the key things to know about managed tables is that if you drop the table
you’re dropping the metadata <em>AND</em> the data.</p>
<h1 id="what-are-external-tables">What are external tables?</h1>
<p>External tables are Hive tables where the data is managed external to Hive. An
example would be a folder full of files with a schmea applied on top. You can
add and remove files to the folder and the contents will be added to the Hive tables.</p>
<p>When you drop an external table, you’re only dropping the metadata the underlying
data will still exist in HDFS.</p>
<p>External table files are also accessible via HDFS and security needs to be managed at the HDFS file/folder level.</p>
<h1 id="when-to-use-them">When to use them?</h1>
<p>Beyond the general answer of “It depends on your use case”, some pointers that
I’d give are:</p>
<h2 id="external-tables">External Tables</h2>
<ul>
<li>The data is used or created outside of Hive. Examples include landing raw files or needing the data to process elsewhere.</li>
<li>You don’t want a DROP TABLE to delete the data. Examples include creating a number
of schemas using the same underlying dataset or creating a partial schema on some data which may be evolving.</li>
</ul>
<h2 id="managed-tables">Managed Tables</h2>
<ul>
<li>The data is temporary.</li>
<li>The data in the table is completely derived from other Hive tables.</li>
</ul>Andrew McCallOne of the things that comes up often in conversations about Hive is using managed vs. external tables.Announcing fn, a serverless framework2017-02-05T12:00:00+00:002017-02-05T12:00:00+00:00http://andrewmccall.com/announcing-fn<p>For the last couple of months I’ve been working on
<a href="http://github.com/andrewmccall/fn">fn</a> (pronounced fun), a serverless
framework. It’s been slow going, but hopefully I’ll get some more time to spend
on it over the coming months.</p>
<p>Serverless frameworks like AWS Lambda look really good, the problem is they’re
tightly tied to the cloud environment they’re running in and provide links
primarily to that providers services.</p>
<p>fn builds an abstraction layer on top of the underlying resource scheduler, the
first iteration will be a YARN based scheduler but I intend to follow it up
with Mesos in the near future.</p>
<p>Over the coming weeks I’ll try to document some of the thinking around the
design and document some of the development.</p>
<p>fn is Apache 2 licensed and is available to fork on github at (http://github.com/andrewmccall/fn).</p>For the last couple of months I’ve been working on fn (pronounced fun), a serverless framework. It’s been slow going, but hopefully I’ll get some more time to spend on it over the coming months.Installing Java 8 on OS X 10.10 Yosimite2014-06-12T20:36:55+00:002014-06-12T20:36:55+00:00http://andrewmccall.com/installing-java-8-on-os-x-10-dot-10-yosimite<p>So I’ve been running OS X 10.10 for a while and tonight decided I’d try to install the Java 8 JDK (JDK8u05) and have a play for a project I’ve been messing around with. Unfortunately I got this:</p>
<p><img src="/images/post_images/java8u05-osx-10.10-install-fail.png" alt="Fail" /></p>
<p>A quick google search and the best I could come up with was to install the latest beta of OpenJDK, which apparently fixes the problem. I didn’t really want a beta version of the JDK so I had a dig around in the pkg to see if I could force the installations.</p>
<p>The checks for versions are just in a text file within the package, so it’s not hard to remove the check and force the installation.</p>
<p>First expand the pkg into a directory so you can get at the contents:</p>
<p><code class="language-plaintext highlighter-rouge">$ pkgutil --expand JDK\ 8\ Update\ 05.pkg JDK8</code></p>
<p>This creates a new folder called JDK8, in there you’ll find a file called <code class="language-plaintext highlighter-rouge">Distribution</code>. Open it in your favourite text editor. At about line 36 you’ll find the following code:</p>
<p>```js linenos:true start:36
// Check is current MAC OS X version less than supportedVersion
function checkForMacOSX(supportedVersion) {
try {
// Get current ProductVersion
var tProductVersion = system.version.ProductVersion;
// Get current ProductName
var tProductName = system.version.ProductName;</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> // Check if current version is less than supportedVersion, if yes Set the result type to Fatal, and give correct message to user
if(tProductVersion &lt; supportedVersion)
{
// Set result values
var osCheckTitle = system.localizedStringWithFormat('OSCHECK_TITLE');
osCheckTitle = osCheckTitle.replace("%1$@", tProductName);
osCheckTitle = osCheckTitle.replace("%2$@", supportedVersion);
var osCheckMessage = system.localizedStringWithFormat('OSCHECK_MESSAGE');
osCheckMessage = osCheckMessage.replace("%1$@", tProductName);
osCheckMessage = osCheckMessage.replace("%2$@", tProductVersion);
osCheckMessage = osCheckMessage.replace("%3$@", supportedVersion);
my.result.title = osCheckTitle;
my.result.message = osCheckMessage;
my.result.type = 'Fatal';
}
} catch (e) {
// an exception just occurred
return (false);
}
// return true
return (true); } ```
</code></pre></div></div>
<p>The error happens on line 45 above, the version check is comparing strings and failing because 10.10 is less than 10.7 lexicographically. If I was fixing the pkg I would rewrite the function to properly check, but I’m not and I just want the JDK to install. So I took everything out except the final <code class="language-plaintext highlighter-rouge">return(true);</code> so that the whole method looks like this:</p>
<p><code class="language-plaintext highlighter-rouge">js start:36
// Check is current MAC OS X version less than supportedVersion
function checkForMacOSX(supportedVersion) {
// return true
return (true);
}
</code></p>
<p>Finally package it back up with the <code class="language-plaintext highlighter-rouge">pkgutils</code> command and the check won’t fail and Java 8 will install.</p>
<p><code class="language-plaintext highlighter-rouge">$ pkgutil --flatten JDK8 JDK8.pkg</code></p>
<p>Double click the new pkg and the installer will run without the error message.</p>
<p><img src="/images/post_images/java8u05-osx-10.10-install-success.png" alt="Success" /></p>
<p>All done, Java 8 JDK installed!</p>So I’ve been running OS X 10.10 for a while and tonight decided I’d try to install the Java 8 JDK (JDK8u05) and have a play for a project I’ve been messing around with. Unfortunately I got this:Validation 1.0.0 released.2010-12-23T00:00:00+00:002010-12-23T00:00:00+00:00http://andrewmccall.com/validation-1-0-0-released-I just pushed my little JSR-303 validation lib out as a 1.0.0 release. The JSR-303 is not 1.0 so it seemed like a good idea. <p />The biggest changes are probably the ones needed to support those. If you were using the library you won't notice any difference, the changes are all internal and the actual annotations haven't changed. <p />I've also removed the springframework package. JSR-303 is now supported in spring 3 and their support is better.<p />Finally I've added some static methods to AbstractAnnotationTest, assertValid and assertViolation; I've found them useful testing annotated classes and figure someone else may too. <p />Check it out and let me know what you think. <p /><a href="http://github.com/andrewmccall/validation">http://github.com/andrewmccall/validation</a>I just pushed my little JSR-303 validation lib out as a 1.0.0 release. The JSR-303 is not 1.0 so it seemed like a good idea. The biggest changes are probably the ones needed to support those. If you were using the library you won't notice any difference, the changes are all internal and the actual annotations haven't changed. I've also removed the springframework package. JSR-303 is now supported in spring 3 and their support is better.Finally I've added some static methods to AbstractAnnotationTest, assertValid and assertViolation; I've found them useful testing annotated classes and figure someone else may too. Check it out and let me know what you think. http://github.com/andrewmccall/validationoEmbed2010-12-16T00:00:00+00:002010-12-16T00:00:00+00:00http://andrewmccall.com/oembedI came across <a href="http://www.oembed.com/">oEmbed</a> the other day, completely by accident. <p />I've been working on a couple of projects in my free time and in a few of them I really wanted to be able to embed better information about a URL, if it existed. I'd been working up my own rough JSON schema that was in many ways similar to <a href="http://www.oembed.com/">oEmbed</a>.<p /><div>Then I came across a tweet that mentioned <a href="http://embed.ly">embed.ly</a> and <a href="http://www.oembed.com/">oEmbed</a> and <a href="http://embed.ly">embed.ly</a> are the perfect solution. Instead of rolling my own service I'm integrating theirs.<br /></div><p /><div>At it's most basic it's a standard way of getting more than just a link you can embed in your site and provides a standard interface for doing it across the web. </div><p /><div>If you're looking to embed things in your website, take a look. There are plugins available for wordpress and others as well as a jQuery plugin you can use pretty much anywhere. </div>I came across oEmbed the other day, completely by accident. I've been working on a couple of projects in my free time and in a few of them I really wanted to be able to embed better information about a URL, if it existed. I'd been working up my own rough JSON schema that was in many ways similar to oEmbed.Then I came across a tweet that mentioned embed.ly and oEmbed and embed.ly are the perfect solution. Instead of rolling my own service I'm integrating theirs.At it's most basic it's a standard way of getting more than just a link you can embed in your site and provides a standard interface for doing it across the web. If you're looking to embed things in your website, take a look. There are plugins available for wordpress and others as well as a jQuery plugin you can use pretty much anywhere. Even more secure passwords.2010-12-15T00:00:00+00:002010-12-15T00:00:00+00:00http://andrewmccall.com/even-more-secure-passwords-<p>A few days ago I posted suggesting that you salt your passwords, I'm back armed with even more knowledge and better advice. Turns out the relative strengths of one hashing algorithm vs another can in fact make a difference, in a way I didn't even consider - their speed. </p>
<p />
<div>Most crypto hash functions are designed for speed, you want to be able to compute the hashes of lots of data pretty quickly if your pushing it down the wire. That speed works in an attackers favour if they're brute forcing a list of passwords and newer hashing functions can make it worse, one of the requirements for SHA-3 is that it's faster than the SHA-2 family. </div>
<p />
<div>So what's the new right answer? </div>
<p />
<div>Choose a function that takes enough time that an attacker has to work for each and every password - ideally long enough that it would take forever to crack just one - while making sure that legitimate users aren't waiting forever while you check their passwords. </div>
<p />
<div>There are two ways of doing this, run a fast hash function many times or deliberately pick a slow hash function.</div>
<p />
<div>Running many iterations of a fast hashing algorithm is pretty self explanatory, run it twice and it takes twice as long, run it a thousand times and it takes a thousand times as long to attack each password.</div>
<p />
<div>Bcrypt is an example of the second, based on the blowfish algorithm it uses the fact that the key setup step is a relatively expensive operation and difficult to optimise. By making use of this bcrypt allows you to set a work factor and creates a hashing algorithm that is expensive and also difficult to optimise. </div>
<p />
<div>Which is better? I have no idea, both are widely used and it really depends on your environment. I'd love to hear what others think though.</div>A few days ago I posted suggesting that you salt your passwords, I'm back armed with even more knowledge and better advice. Turns out the relative strengths of one hashing algorithm vs another can in fact make a difference, in a way I didn't even consider - their speed. Most crypto hash functions are designed for speed, you want to be able to compute the hashes of lots of data pretty quickly if your pushing it down the wire. That speed works in an attackers favour if they're brute forcing a list of passwords and newer hashing functions can make it worse, one of the requirements for SHA-3 is that it's faster than the SHA-2 family. So what's the new right answer? Choose a function that takes enough time that an attacker has to work for each and every password - ideally long enough that it would take forever to crack just one - while making sure that legitimate users aren't waiting forever while you check their passwords. There are two ways of doing this, run a fast hash function many times or deliberately pick a slow hash function. Running many iterations of a fast hashing algorithm is pretty self explanatory, run it twice and it takes twice as long, run it a thousand times and it takes a thousand times as long to attack each password. Bcrypt is an example of the second, based on the blowfish algorithm it uses the fact that the key setup step is a relatively expensive operation and difficult to optimise. By making use of this bcrypt allows you to set a work factor and creates a hashing algorithm that is expensive and also difficult to optimise. Which is better? I have no idea, both are widely used and it really depends on your environment. I'd love to hear what others think though.Chrome OS.2010-12-14T00:00:00+00:002010-12-14T00:00:00+00:00http://andrewmccall.com/chrome-os-I don't think Chrome OS is going to be a success. <p /> It's perfect for Google, shows that all you really need is a browser and does it in a compelling way. But it's not going to be a success and I'll cast my lot in with those that predict that it'll get rolled into Android. <br />The reason I don't see it working is that Google won't use it. Google is a company of engineers, developers and hackers - what do they want with a computer they can't use to do that? <br />The Chrome OS team is designing and building a product for someone else. Taking things away is a great thing, look at Apple, but if you take so much away that it becomes a product for someone else and you're no longer eating your own dog food. That doesn't usually work out well. <p /> I may be wrong, maybe the team is great and maybe they can overcome the obstacles, but I'll put my money on the best that will come of Chrome OS is that it's evolved into another product or as many are predicting, rolled into Android.I don't think Chrome OS is going to be a success. It's perfect for Google, shows that all you really need is a browser and does it in a compelling way. But it's not going to be a success and I'll cast my lot in with those that predict that it'll get rolled into Android. The reason I don't see it working is that Google won't use it. Google is a company of engineers, developers and hackers - what do they want with a computer they can't use to do that? The Chrome OS team is designing and building a product for someone else. Taking things away is a great thing, look at Apple, but if you take so much away that it becomes a product for someone else and you're no longer eating your own dog food. That doesn't usually work out well. I may be wrong, maybe the team is great and maybe they can overcome the obstacles, but I'll put my money on the best that will come of Chrome OS is that it's evolved into another product or as many are predicting, rolled into Android.Salt your passwords.2010-12-13T00:00:00+00:002010-12-13T00:00:00+00:00http://andrewmccall.com/salt-your-passwords-<p>Gawker media is the latest in a long list of compromised systems that have <a href="http://thenextweb.com/media/2010/12/13/gawker-hackers-release-file-with-ftp-author-reader-usernamespasswords/">exposed user passwords.</a> Unlike when it happened to the <a href="http://andrewmccall.com/2010/04/apache-falls-victim-to-jira-xss-exploit">ASF a few months ago,</a> I'm unaffected.</p>
<p />
<div><a href="http://blogs.forbes.com/firewall/2010/12/13/the-lessons-of-gawkers-security-mess/">Forbes</a> and others are banging on about weak encryption being the problem, it's not. Passwords aren't encrypted generally, they're passed through a one way <a href="http://en.wikipedia.org/wiki/Hash_function">hash function</a>. You can't undo the hash, so you can't decrypt the passwords. When you hash the same value though it will always produce the same hash - so you can ask a user for their password, hash the value they enter and check that against the hash you've stored.</div>
<p />
<div>The relative strengths of one hash function vs another actually makes very little difference when it comes to passwords. As long as it's collision free for the set of possible passwords, which almost all will be, they're really strong enough no matter how old they are. </div>
<p />
<div>
<div>Gawker made a basic mistake that even the most advanced algorithm wouldn't help, they're not salting their passwords.
<p />
</div>
</div>
<div><br />
<div>
<div>Cracking hashed passwords involves computing the hashes until you create the same hashed value. You run the algorithm across a list of know common passwords, dictionary words and common variations. The same value will always produce the same hash, so everyone who uses the same password will also always have the same hash. You just need to compute all the common/obvious ones and look at all the users to find the ones with that match your list. Lots of those users will probably be using their password for email and other services too... oops.</div>
<p />
<div>Salting adds something unique to a user, say their email address or ID, forcing an attacker to compute every possible password for each user individually. Even if two users have chosen the same password they will have a different hash. The better the salt you can choose, the more work an attacker has to do to get passwords.</div>
<p />
<div>It's not a panacea though, you've still exposed their details and given enough time a determined attacker can and will be able to recover every last password. What it gives you is time to disclose the breach and your users time to change their passwords on other services which may be the same.</div>
</div>
<p />
<div><strong>UPDATE: <em>I've added a new post with some more thoughts, clarifications and corrections <a href="http://andrewmccall.com/even-more-secure-passwords">here</a></em></strong></div>
</div>Gawker media is the latest in a long list of compromised systems that have exposed user passwords. Unlike when it happened to the ASF a few months ago, I'm unaffected. Forbes and others are banging on about weak encryption being the problem, it's not. Passwords aren't encrypted generally, they're passed through a one way hash function. You can't undo the hash, so you can't decrypt the passwords. When you hash the same value though it will always produce the same hash - so you can ask a user for their password, hash the value they enter and check that against the hash you've stored. The relative strengths of one hash function vs another actually makes very little difference when it comes to passwords. As long as it's collision free for the set of possible passwords, which almost all will be, they're really strong enough no matter how old they are. Gawker made a basic mistake that even the most advanced algorithm wouldn't help, they're not salting their passwords. Cracking hashed passwords involves computing the hashes until you create the same hashed value. You run the algorithm across a list of know common passwords, dictionary words and common variations. The same value will always produce the same hash, so everyone who uses the same password will also always have the same hash. You just need to compute all the common/obvious ones and look at all the users to find the ones with that match your list. Lots of those users will probably be using their password for email and other services too... oops. Salting adds something unique to a user, say their email address or ID, forcing an attacker to compute every possible password for each user individually. Even if two users have chosen the same password they will have a different hash. The better the salt you can choose, the more work an attacker has to do to get passwords. It's not a panacea though, you've still exposed their details and given enough time a determined attacker can and will be able to recover every last password. What it gives you is time to disclose the breach and your users time to change their passwords on other services which may be the same. UPDATE: I've added a new post with some more thoughts, clarifications and corrections hereGit checkout a tag.2010-11-13T00:00:00+00:002010-11-13T00:00:00+00:00http://andrewmccall.com/git-checkout-a-tag-When I recently rebuilt my development server using <a href="http://www.puppetlabs.com/">Puppet</a> I decided not to backup my nexus repository since there were all of three files I'd built in there. Today I wanted to start recreating them and couldn't find out how to checkout a git tag. <p /><div>A couple of google searches weren't terrible successful and it's something that's not really explained well so here it is. </div><p /><div>To checkout a tag you just have to </div><div>[code]git checkout <TAG_NAME> [/code]</div><p /><div>Yes it's that simple. </div><p /><div>You'll probably get the same warning message telling you you're in a detached HEAD state. If you intended to do some work here it's probably a really good idea to take the advice the message gives you and create a branch. You could do that when you checkout by adding -b</div><p /><div>[code]git checkout <TAG_NAME> -b <BRANCH_NAME>[/code]</div><p /><div>I didn't want to do anything except a [code]mvn clean deploy[/code] to get my library deployed back to nexus, then get back to where I was. So I didn't. </div>When I recently rebuilt my development server using Puppet I decided not to backup my nexus repository since there were all of three files I'd built in there. Today I wanted to start recreating them and couldn't find out how to checkout a git tag. A couple of google searches weren't terrible successful and it's something that's not really explained well so here it is. To checkout a tag you just have to [code]git checkout <TAG_NAME> [/code]Yes it's that simple. You'll probably get the same warning message telling you you're in a detached HEAD state. If you intended to do some work here it's probably a really good idea to take the advice the message gives you and create a branch. You could do that when you checkout by adding -b[code]git checkout <TAG_NAME> -b <BRANCH_NAME>[/code]I didn't want to do anything except a [code]mvn clean deploy[/code] to get my library deployed back to nexus, then get back to where I was. So I didn't.