As an analysis it’s my job to analyze data. Quite often I
run across a common problem with dates exported from a mainframe job. Often the
date is 6 characters long without any delimiter. Why oh why didn’t Y2K resolve
this problem. Y2K??? you ask inquisitively?
For those not knowing, Y2K was a problem back in the late
1990s where computer programmers had to correct computer code. The problem was simple.
When computers were new, memory was expensive so coders only represented the
year with only 2 digits. Who would have thought that the code written in the
late 70’s and early 80’s would still be in use?
As a result when the year 2000 rolled around, all the old
code would show 00 for the year. We were told dire consequences would ensue;
planes would drop from the sky, bank accounts would empty, worldwide panic.
Fortunately none of that happened. But happily for thousands of programmers,
secure jobs were available to verify millions of lines of code. The fix was simple;
change the year to be 4 digits.
However there was also another fix, recode the handling of
two digits so that the computer would know that 00 did not represent 1900, but
rather 2000. And that leads me to today’s problem.
Whenever I see dates represented with only 6 digits I start
by breaking the data down into columns for Month Day and Year. However I don’t know
which column is the Month, Day or Year.
So I start by looking at the first column and see if there
are numbers greater the 12. If there are, then the column can’t be the Month
since there are only 12 months in a year. In this example one cell has a 13 in it so
this column cannot represent a month.
Continuing with my example the third column also has numbers
greater than 12 so that leaves the second column as my month.
Examples
03
|
09
|
17
|
12
|
06
|
11
|
10
|
05
|
14
|
07
|
07
|
13
|
09
|
03
|
13
|
03
|
11
|
06
|
03
|
06
|
27
|
10
|
06
|
09
|
05
|
06
|
01
|
05
|
08
|
28
|
04
|
05
|
23
|
01
|
08
|
30
|
08
|
07
|
15
|
05
|
06
|
28
|
11
|
12
|
31
|
13
|
06
|
07
|
07
|
01
|
03
|
08
|
05
|
01
|
04
|
04
|
30
|
11
|
08
|
09
|
So now the mystery is which is the day column and which is
the year column. Here is where you need to know your data. With my work I
mostly deal with data from the years 1995 through 2014. So 95 – 14. Column 1
has numbers 1 – 13. However column 3 has numbers that include 31, 30, 27 and
28. Hmmmm. It looks like a day column. And it’s easy enough to verify. If the
second column is the month (and we have already determined that it is), then I
can verify the month in the row with the 31 in the third column. If that month
has 31 days, then I then know the day is the third column of data. In my above
example I have a 12 in the second column (December).
11
|
12
|
31
|
So the second column is the Month and the 31 confirms that
the third column is the day. That leaves the first column as the year.
So could I code this solution in Excel?
I start by identifying the max value in each column.
|
A
|
B
|
C
|
D
|
E
|
F
|
G
|
1
|
3
|
9
|
17
|
Max Col 1
|
Max Col 2
|
Max Col 3
|
|
2
|
12
|
6
|
11
|
13
|
12
|
31
|
|
3
|
10
|
5
|
14
|
=MAX(A:A)
|
=MAX(B:B)
|
=MAX(C:C)
|
|
4
|
7
|
7
|
13
|
||||
5
|
9
|
3
|
13
|
Well right off I know column 2 has to be month since it’s
max value is 12 and column 1 and 3 max value is greater than 12.
I also see column 3 has a max value of 31. So that looks
like the day.
So very quickly I can identify which columns are which.
Year Month Day