Shortest Formula Challenge #3: Results and Discussion 1

Last week I set readers the challenge which can be found here.

This one provoked quite a bit of debate, and not all of it Excel-related! As I already have to several readers, I must again apologize for the lack of realism and statistical know-how inherent in the premise for this challenge, which was evidently constructed more with the required formula-work in mind than with any serious thought to methods in demography.

Still, at least some fascinating and impressive Excel work came out of it all, so perhaps my poor groundwork is somewhat forgiven, at least retrospectively!

It’s also perhaps the first in this series of Shortest Formula Challenges in which I cannot say with absolute confidence that we have achieved the shortest possible solution. However, we’ve certainly done a collectively good job, I feel, and thanks to the contributions of so many I imagine that we can’t be too far off the optimal construction.

And this week’s challenge is also unique in that I will not be awarding the prize to any single person, but rather to two, both of whom made significant contributions along the way to achieving that solution: John Jairo V and Lori.

Having said that, the bulk of the credit has to go to John, who gave us the initial construction – an impressive fusion of MMULT, SUMIF and OFFSET – which Lori then managed to reduce in length somewhat using a truly awe-inspiring technique – involving a reciprocation of the FREQUENCY function – which I for one had not previously come across and which is certainly one to remember.

So congratulations to John and Lori!

Of course, although his set-up was not particularly short, congratulations are also due to Chris BM, who also provided a correct solution. And we mustn’t forget that, yet again (!), Isai Alvarado topped everyone with his Google Sheets solution, even though the vast majority of the components in its 122 characters I imagine look completely alien to most of us Excel-heads!

Anyway, the solution we finally arrived at is:

=LOOKUP(,0/FREQUENCY(0,1/MMULT(SUMIF(B3:B12,B3:B12,OFFSET(D3,,{0,1,2}))/SUMIF(B3:B12,B3:B12,OFFSET(C3,,{0,1,2})),{1;1;1})),B3:B4)

I should first warn readers that, this being a Shortest Formula Challenge, some liberties have been taken with regards to syntax in the above construction which would otherwise not be recommendable.

Indeed, in the attempt to shave off as many characters as possible, it could perhaps be argued that some of the “techniques” employed have resulted in a formula which, whilst achieving the goal of being ever more concise, has nevertheless paid a high price in terms of its resulting lucidity.

Still, the reasons for these omissions and syntactical anomalies being perfectly acceptable are not without interest in their own right, and so I have decided to present the formula in all its glorious artifice, with the caveat that I promise to reveal the technical justifications for such underhand practices – including replacing zero parameters with nothing, leaving out others entirely and reducing a range reference seemingly inexplicably – in my deconstruction below.

How does it work?

The important point to note – and for which John must be credited – is the realisation that, in order to measure the average year-on-year increase, it is not strictly necessary that we go as far as actually calculating any form of mathematical average.

The reason being that, in any case, the number of years over which we are peforming our calculations is constant for each Region, i.e. 3.

As such, it will suffice to simply calculate, for each of those Regions, a total of all year-on-year ratios for all Countries within a given Region. And this John achieved using a combination of MMULT, OFFSET and SUMIF, so let’s look at how he achieved this.

Firstly, this part:

SUMIF(B3:B12,B3:B12,OFFSET(D3,,{0,1,2}))

is, effectively, performing a series of three distinct SUMIF calculations, the three sum_ranges for which are generated using the part:

OFFSET(D3,,{0,1,2}

Now, here’s the first of those “liberties” for minimizing formula length which I described earlier. What this actually amounts to (and arguably the form we should maintain for the sake of clarity, were we not involved in such a contest) is a construction which is, in the end, identical to:

OFFSET(D3,,{0,1,2},10)

It should be pointed out that this is not the case in general. In fact, other factors aside, omission of either (or both) of the height or width parameters in OFFSET is identical to as if we had entered a value of 1 for those parameters, not 10.

As Lori explained, however, since the resulting construction is being passed to a further function (SUMIF in this case), and since that function requires that the sum_range be of a size and displacement equal to that of the range, here Excel (artificially) expands our range generated from OFFSET to meet that requirement.

Hence, whereas under normal circumstances, a construction such as:

OFFSET(D3,,{0,1,2}

appropriately coerced, would resolve to an array of single-cell range references, i.e. (note that this is a purely contrived representation: in reality, there is no reasonable means of representing such an “intermediate” array of range references; hence Excel’s habit of displaying #VALUE! at such times):

{D3,E3,F3}

in our case Excel takes each of those references and extends them (downwards, as it happens) such that they meet our criterion that they be of the same size as our range, i.e. B3:B12. In essence, then:

SUMIF(B3:B12,B3:B12,OFFSET(D3,,{0,1,2}))

is actually resolved just as if it were (again, licence taken with the “resolution” of this array):

SUMIF(B3:B12,B3:B12,{D3:D12,E3:E12,F3:F12})

and NOT, as we might have supposed, as:

SUMIF(B3:B12,B3:B12,{D3,E3,F3})

Hence, we will be returning a 10-row-by-3-column array of results using this construction, the entries in each column of which will be equivalent to the results of the three individual calculations:

SUMIF(B3:B12,B3:B12,D3:D12)
SUMIF(B3:B12,B3:B12,E3:E12)
SUMIF(B3:B12,B3:B12,F3:F12)

(Though it should be pointed out that each of these formulas would, on its own, be insufficient, since there is no additional, external formula acting to coerce an array of returns.)

Hence:

SUMIF(B3:B12,B3:B12,OFFSET(D3,,{0,1,2}))

will resolve to (since here we do have that external coercion, courtesy of MMULT):

{1345097171,1351670528,1358363416;8487319,8457915,8429700;22210006,22872070,23546083;6287968,6329821,6373552;1345097171,1351670528,1358363416;22210006,22872070,23546083;6287968,6329821,6373552;8487319,8457915,8429700;1345097171,1351670528,1358363416;8487319,8457915,8429700}

in which the three entries in each of the ten rows represent the sum from column D, E and F respectively for each of the Regions in column B. For example, the first row of values – {1345097171,1351670528,1358363416} – are the totals from column D, E and F corresponding to a column B entry of “East Asia & Pacific”, etc.

Now, to determine the year-on-year changes, we simply produce an identical matrix of values to the above, but this time for columns C, D and E, and then divide each of the values in the above array by the corresponding element in this new array. Hence, by a similar logic to that outlined above, it can be verified that:

SUMIF(B3:B12,B3:B12,OFFSET(C3,,{0,1,2}))

will resolve to:

{1338663302,1345097171,1351670528;8522630,8487319,8457915;21558045,22210006,22872070;6249188,6287968,6329821;1338663302,1345097171,1351670528;21558045,22210006,22872070;6249188,6287968,6329821;8522630,8487319,8457915;1338663302,1345097171,1351670528;8522630,8487319,8457915}

And so, performing our division means that:

SUMIF(B3:B12,B3:B12,OFFSET(D3,,{0,1,2}))/SUMIF(B3:B12,B3:B12,OFFSET(C3,,{0,1,2}))

gives us:

{1.00480618912193,1.00488690121556,1.00495156760568;0.995856795378891,0.996535537311606,0.996664071464421;1.03024212074889,1.02980926704837,1.02946882376628;1.00620560623236,1.00665604532339,1.00690872617093;1.00480618912193,1.00488690121556,1.00495156760568;1.03024212074889,1.02980926704837,1.02946882376628;1.00620560623236,1.00665604532339,1.00690872617093;0.995856795378891,0.996535537311606,0.996664071464421;1.00480618912193,1.00488690121556,1.00495156760568;0.995856795378891,0.996535537311606,0.996664071464421}

in which, for example, the first value of 1.00480618912193 represents the proportional increase in population for the Region “East Asia & Pacific” between 2010 and 2011.

In order to determine which of the ten Regions had the highest year-on-year average, we simply sum the three proportions in the above array for each Region and identify the maximum of these (as stated earlier, there is actually no need to compute a mathematical average here). This means summing the three entries in each of the ten rows in the above matrix, which of course we achieve using MMULT, such that:

MMULT(SUMIF(B3:B12,B3:B12,OFFSET(D3,,{0,1,2}))/SUMIF(B3:B12,B3:B12,OFFSET(C3,,{0,1,2})),{1;1;1})

which is:

MMULT({1.00480618912193,1.00488690121556,1.00495156760568;0.995856795378891,0.996535537311606,0.996664071464421;1.03024212074889,1.02980926704837,1.02946882376628;1.00620560623236,1.00665604532339,1.00690872617093;1.00480618912193,1.00488690121556,1.00495156760568;1.03024212074889,1.02980926704837,1.02946882376628;1.00620560623236,1.00665604532339,1.00690872617093;0.995856795378891,0.996535537311606,0.996664071464421;1.00480618912193,1.00488690121556,1.00495156760568;0.995856795378891,0.996535537311606,0.996664071464421},{1;1;1})

gives us:

{3.01464465794317;2.98905640415492;3.08952021156354;3.01977037772668;3.01464465794317;3.08952021156354;3.01977037772668;2.98905640415492;3.01464465794317;2.98905640415492}

Now, here’s where the ingenious method given us by Lori comes into play. Whereas John (and myself, and the vast majority of others I imagine) employed a standard technique using INDEX, MATCH and MAX here to determine the Region which corresponds to the maximum value in the above array, viz:

=INDEX(B3:B12,MATCH(MAX(MMULT(SUMIF(B3:B12,B3:B12,OFFSET(D3,,{0,1,2}))/SUMIF(B3:B12,B3:B12,OFFSET(C3,,{0,1,2})),{1;1;1})),MMULT(SUMIF(B3:B12,B3:B12,OFFSET(D3,,{0,1,2}))/SUMIF(B3:B12,B3:B12,OFFSET(C3,,{0,1,2})),{1;1;1}),0))

which is perfectly fine, though, this being a shortest formula challenge (and even if it wasn’t, to be honest), a little wanting in the brevity department.

What Lori did was to employ a completely different technique using LOOKUP and FREQUENCY, which gives identical results to the above, is much more succinct, and which is also frankly sublime when you see it for the first time.

Let’s remind ourselves of the solution and I’ll then explain this innovative technique:

=LOOKUP(,0/FREQUENCY(0,1/MMULT(SUMIF(B3:B12,B3:B12,OFFSET(D3,,{0,1,2}))/SUMIF(B3:B12,B3:B12,OFFSET(C3,,{0,1,2})),{1;1;1})),B3:B4)

We can substitute in our values obtained from the MMULT above:

=LOOKUP(,0/FREQUENCY(0,1/{3.01464465794317;2.98905640415492;3.08952021156354;3.01977037772668;3.01464465794317;3.08952021156354;3.01977037772668;2.98905640415492;3.01464465794317;2.98905640415492}),B3:B4)

We now reciprocate this array of values with unity, giving:

=LOOKUP(,0/FREQUENCY(0,{0.331714053716128;0.334553740307462;0.323674852896956;0.331151006505605;0.331714053716128;0.323674852896956;0.331151006505605;0.334553740307462;0.331714053716128;0.334553740307462}),B3:B4)

The logical reasoning behind this prior reciprocation is simple: the largest value in the array prior to reciprocation (which of course is what we are seeking to determine) will necessarily be the smallest value in the new, post-reciprocation array.

And the reason that this is useful is that, in general, if we pass an array of values – all between 0 and 1 – as the bins_array to FREQUENCY, with 0 as the data_array, then that zero will be assigned to the smallest value from our bins_array; the remaining bins of course will be empty, or zero.

Hence, the above will resolve to:

=LOOKUP(,0/{0;0;1;0;0;0;0;0;0;0;0},B3:B4)

We now employ a standard LOOKUP construction with reciprocation, so that the above becomes:

=LOOKUP(,{#DIV/0!;#DIV/0!;0;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!;#DIV/0!},B3:B4)

Here are two further examples of the extent to which we have pushed our attempts at creating the formula with the least number of characters possible. In this standard construction, any suitable value >=0 for the lookup_value will suffice, and here we have taken advantage of the fact that omitting this parameter is equivalent to entering a value of zero for that parameter.

Not only that, but we can even “cheat” by saving one character off the result_vector. As Lori pointed out, it is not strictly necessary to use the range B3:B12 here: by using a reduced range of B3:B4 we save one character, the technical justification for which is given by virtue of the fact that, just as we saw earlier with the omission of the height parameter for OFFSET, here too Excel helpfully redimensions our range of B3:B4 to the required B3:B12 so that our LOOKUP construction is syntactically valid.

And so, finally, the above resolves to “Sub-Saharan Africa”, as desired.

Thanks again to all who contributed, especially to John and Lori. Another challenge to follow shortly. Watch this space!

One comment

  1. Oh I did not realize there was no need to compute the mathematical average. In that case, the Google Sheets formula could be shortened to

    Formula

    =index(sort(query(B3:F12,"select B,sum(D)/sum(C)+sum(E)/sum(D)+sum(F)/sum(E)group by B"),2,0),2,1)

    Length 97 characters

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s