6 digit year analysis



As an analysis it’s my job to analyze data. Quite often I run across a common problem with dates exported from a mainframe job. Often the date is 6 characters long without any delimiter. Why oh why didn’t Y2K resolve this problem. Y2K??? you ask inquisitively?

For those not knowing, Y2K was a problem back in the late 1990s where computer programmers had to correct computer code. The problem was simple. When computers were new, memory was expensive so coders only represented the year with only 2 digits. Who would have thought that the code written in the late 70’s and early 80’s would still be in use?

As a result when the year 2000 rolled around, all the old code would show 00 for the year. We were told dire consequences would ensue; planes would drop from the sky, bank accounts would empty, worldwide panic. Fortunately none of that happened. But happily for thousands of programmers, secure jobs were available to verify millions of lines of code. The fix was simple; change the year to be 4 digits.

However there was also another fix, recode the handling of two digits so that the computer would know that 00 did not represent 1900, but rather 2000. And that leads me to today’s problem.
Whenever I see dates represented with only 6 digits I start by breaking the data down into columns for Month Day and Year. However I don’t know which column is the Month, Day or Year.

So I start by looking at the first column and see if there are numbers greater the 12. If there are, then the column can’t be the Month since there are only 12 months in a year.  In this example one cell has a 13 in it so this column cannot represent a month.

Continuing with my example the third column also has numbers greater than 12 so that leaves the second column as my month.


Examples
03
09
17
12
06
11
10
05
14
07
07
13
09
03
13
03
11
06
03
06
27
10
06
09
05
06
01
05
08
28
04
05
23
01
08
30
08
07
15
05
06
28
11
12
31
13
06
07
07
01
03
08
05
01
04
04
30
11
08
09


So now the mystery is which is the day column and which is the year column. Here is where you need to know your data. With my work I mostly deal with data from the years 1995 through 2014. So 95 – 14. Column 1 has numbers 1 – 13. However column 3 has numbers that include 31, 30, 27 and 28. Hmmmm. It looks like a day column. And it’s easy enough to verify. If the second column is the month (and we have already determined that it is), then I can verify the month in the row with the 31 in the third column. If that month has 31 days, then I then know the day is the third column of data. In my above example I have a 12 in the second column (December). 

11
12
31


So the second column is the Month and the 31 confirms that the third column is the day. That leaves the first column as the year.

So could I code this solution in Excel? 

I start by identifying the max value in each column.



A
B
C
D
E
F
G
1
3
9
17

Max Col 1
Max Col 2
Max Col 3
2
12
6
11

13
12
31
3
10
5
14

=MAX(A:A)
=MAX(B:B)
=MAX(C:C)
4
7
7
13




5
9
3
13






Well right off I know column 2 has to be month since it’s max value is 12 and column 1 and 3 max value is greater than 12.

I also see column 3 has a max value of 31. So that looks like the day.

So very quickly I can identify which columns are which.

Year  Month  Day

No comments:

Post a Comment