Advanced Formula Challenge #2: Results and Discussion 3

Last week I set readers the challenge which can be found here.

Three solutions were offered, two of which from the same person, and both of which were correct! So many congratulations to Bill on successfully solving what was quite a complex challenge!

Indeed, as Ben Schwartz pointed out, this challenge appears to have been set previously on the internet, and seems to have been only partially solved on those occasions. In any case, thanks also to Ben for his suggestion, which he confesses was cobbled together from those previous solutions he found, and which worked in all but a few exceptional cases.

We had some dispute as to whether Bill’s first solution was really “to the point”, and although it worked for the given dataset, it would have been a challenge indeed (and a rather tortuous task!) to extend that set-up to work for larger datasets as well.

However, the second solution offered by Bill suffered no such drawbacks, and is fully sound. So well done again, Bill!

I would now like to give a breakdown of the solution I came up with, which is quite a bit shorter than Bill’s. I’m going to use the formula in A3 as an example (the entry in B1 – “Dale”, is perhaps not the best example to choose since that and “Adel” from E1:E10 are the only strings of length four present across both ranges, which will perhaps not serve to highlight all the issues at hand).

The solution, for that cell, is the following array formula:

=OR((LEN($E$1:$E$10)=LEN(B3))*(MMULT(0+(LEN(SUBSTITUTE(LOWER($E$1:$E$10),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),""))=LEN(SUBSTITUTE(LOWER(B3),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),""))),ROW(INDIRECT("1:"&LEN(B3)))^0))=LEN(B3))

How does it work?

The first part:

LEN($E$1:$E$10)=LEN(B3)

is obvious enough, simply checking which of the entries in E1:E10 have the same number of characters as that in B3, and so resolves to:

{TRUE;FALSE;TRUE;TRUE;FALSE;TRUE;TRUE;TRUE;TRUE;FALSE}

Let’s look at the main clause then, and concentrate on the first array being passed to MMULT for now, i.e.:

0+(LEN(SUBSTITUTE(LOWER($E$1:$E$10),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),""))=LEN(SUBSTITUTE(LOWER(B3),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),"")))

Taking this part:

LEN(SUBSTITUTE(LOWER($E$1:$E$10),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),""))

we should be quite happy that:

LOWER($E$1:$E$10)

will simply convert all our entries in E1:E10 to lower case, and so avoid any potential issues that may arise with certain Excel functions considering e.g. “A” to be different from “a” (SUBSTITUTE being one example). Hence the return will look like:

{"andrew";"adel";"shelia";"anelie";"blake";"andria";"shelia";"eleina";"kassia";"daren"}

This array is being passed to the SUBSTITUTE function, for which the old_text parameter is derived from the expression:

MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1)

so let’s break this part down.

Well, LOWER(B3) is simply “eliane” and:

TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3))))

resolves to:

TRANSPOSE(ROW(INDIRECT("1:"&6)))

which is:

TRANSPOSE({1;2;3;4;5;6})

i.e. {1,2,3,4,5,6}.

Hence, substituting these back in:

MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1)

becomes:

MID("eliane",{1,2,3,4,5,6},1)

and we see therefore that the purpose of this construction was simply to create an array consisting of all the individual characters for the string in B3, since clearly this resolves to the array {“e”,”l”,”i”,”a”,”n”,”e”}.

Therefore:

LEN(SUBSTITUTE(LOWER($E$1:$E$10),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),""))

becomes:

LEN(SUBSTITUTE({"andrew";"adel";"shelia";"anelie";"blake";"andria";"shelia";"eleina";"kassia";"daren"},{"e","l","i","a","n","e"},""))

The important thing to note here (and it may well be worth readers going through some of my previous posts if they are not too familiar with the concept of vector displacement) is that the first of the arrays being passed to the SUBSTITUTE function (the text parameter) is a single-column vector and the second (the old_text parameter) a single-row vector.

This was the reason for the use of TRANSPOSE, since, our two arrays being orthogonal, we know that the array resulting from the two arrays here – one a 10-row-by-1-column array, the other a 1-row-by-6-column array – will be a 10-row-by-6-column array, the entries in which will be the results of the SUBSTITUTE function applied to all 60 combinations generated by pairing each of the 10 elements from the first array with each of the 6 elements in the second. This 10-by-6 matrix will therefore look like:

{"andrw","andrew","andrew","ndrew","adrew","andrw";"adl","ade","adel","del","adel","adl";"shlia","sheia","shela","sheli","shelia","shlia";"anli","aneie","anele","nelie","aelie","anli";"blak","bake","blake","blke","blake","blak";"andria","andria","andra","ndri","adria","andria";"shlia","sheia","shela","sheli","shelia","shlia";"lina","eeina","elena","elein","eleia","lina";"kassia","kassia","kassa","kssi","kassia","kassia";"darn","daren","daren","dren","dare","darn"}

Applying LEN to these will give:

{5,6,6,5,5,5;3,3,4,3,4,3;5,5,5,5,6,5;4,5,5,5,5,4;4,4,5,4,5,4;6,6,5,4,5,6;5,5,5,5,6,5;4,5,5,5,5,4;6,6,5,4,6,6;4,5,5,4,4,4}

The entries in this array correspond to the lengths of the strings in E1:E10 after removing in turn each of the characters in the string “eliane” (i.e. “e”, “l”, “i”, “a”, “n” and “e”).

For example, the entries in the first row of this matrix – 5,6,6,5,5,5 – equate to the respective lengths resulting from removing “e”, “l”, “i”, “a”, “n” and “e” from “andrew”.

And the entries in the second row of this matrix – 3,3,4,3,4,3 – equate to the respective lengths resulting from removing “e”, “l”, “i”, “a”, “n” and “e” from “adel”.

and so on and so on.

Now, I’ll come to the reason why we’ve gone to the effort to construct this array shortly, and also the reason why we’re then passing it to MMULT, but first we need to look at the part on the other side of the equals sign within that first array, which is identical to this one apart from the fact that, whereas with the previous version we passed the entire range E1:E10 as the first argument to the SUBSTITUTE function, here we are passing just the single cell B3. This time, then:

LEN(SUBSTITUTE(LOWER(B3),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),""))

will resolve to, with a logic similar to that in the previous expansion:

LEN(SUBSTITUTE("eliane",{"e","l","i","a","n","e"},""))

which is:

LEN({"lian","eiane","elane","eline","eliae","lian"})

i.e. {4,5,5,5,5,4}.

And so, just as in the previous case, all we have done here is to effectively perform six substitutions on the string “eliane”, using each of the six individual characters from that same string, and returned an array of the resulting lengths after these substitutions.

But of course this method is doing nothing other than telling us precisely how many of each of those characters appears in that string. So, effectively, the array {4,5,5,5,5,4} can be interpreted as saying, quite simply, that there is/are:

2 occurrences of the character “e” in “eliane”
1 occurrence of the character “l” in “eliane”
1 occurrence of the character “i” in “eliane”
1 occurrence of the character “a” in “eliane”
1 occurrence of the character “n” in “eliane”
2 occurrences of the character “e” in “eliane”

All this is self-evident, at least to us: what we have done is merely to generate a means by which we can express those truths in Excel terms.

Now, the reason for all this is that, if, for any of the 10 1-by-6 arrays from our large 60-element matrix – and recall that each of these 1-by-6 arrays consists of the lengths of the 10 strings in E1:E10 after removing each of the 6 letters from “eliane” in turn – all 6 of the elements within that array correspond precisely with these values here ({4,5,5,5,5,4}), and occupy the same place within their respective arrays, then we can conclude that the string corresponding to that particular 1-by-6 array must be an anagram of the string in B3.

To try to clarify, let’s take the first row in that large matrix – {5,6,6,5,5,5} – which corresponds to the results of that expression applied to “andrew”.

The first element (5) equates to the fact that, after having the first character in “eliane”, i.e. “e” removed from it, the length of the resulting string – “andrw” – is 5.

But of course, expressed differently, this is just the same as saying that there is precisely one occurrence of the letter “e” in “andrew”.

Similarly, the second element in that array (6) equates to the fact that, after having the second character in “eliane”, i.e. “l” removed from it, the length of the resulting string – “andrew” – is still 6 (SUBSTITUTE rather nicely leaves a string unchanged if the old_text parameter is not found).

And, again, this is just the same as saying that there are zero occurrences of the letter “l” in “andrew”.

Readers can easily verify that the third, fourth, fifth and sixth elements in that array – 6, 5, 5 and 5 – correspond to the lengths of the strings “andrew”, “ndrew”, “adrew” and “andrw” respectively.

As such, we can now confidently state that there is/are:

1 occurrence of the character “e” in “andrew”
0 occurrences of the character “l” in “andrew”
0 occurrences of the character “i” in “andrew”
1 occurrence of the character “a” in “andrew”
1 occurrence of the character “n” in “andrew”
1 occurrence of the character “e” in “andrew”

(The fact that the last statement is a repetition of the first is not important – I merely left it in for the purpose of completing the analysis.)

And so “andrew” is not an anagram of “eliane”. Of course, we can see this straight away, but for Excel things are not so self-evident (in fact, this whole challenge is surprisingly more difficult than it would at first appear).

The next string to be considered, i.e. that from cell E2 (“adel”), is only of length four characters and so will not even get past our initial clause (which checks whether the string is of the same length as that in B3). So let’s take the next string from E1:E10 which does pass that test so that we can have another example to look at, and that string will be from cell E3, i.e. “Shelia”.

Performing a similar analysis for the third row in our large 10-by-6 matrix then, i.e. {5,5,5,5,6,5}, and recalling that this array derives from having each of the characters in “eliane” respectively removed from “shelia”, we can see that this array equates to the statements:

1 occurrence of the character “e” in “shelia”
1 occurrence of the character “l” in “shelia”
1 occurrence of the character “i” in “shelia”
1 occurrence of the character “a” in “shelia”
0 occurrences of the character “n” in “shelia”
1 occurrence of the character “e” in “shelia”

Five out of six. Close, but not close enough. Again, I have left the first and last lines in for clarity. What we would ideally want is for both those lines to state “2 occurrences of the character “e” in “shelia”” since that is precisely how many of those characters there are in “eliane”.

Let’s take one more, if nothing else because the next entry in E1:E10, “anelie”, does in fact turn out to be an anagram of “eliane”. The row-array corresponding to that name from our large matrix is:

(4,5,5,5,5,4)

and you may recognize this as being precisely the same array which we obtained when we performed those substitutions on the string “eliane” itself. To be consistent, our statements here would be:

2 occurrences of the character “e” in “anelie”
1 occurrence of the character “l” in “anelie”
1 occurrence of the character “i” in “anelie”
1 occurrence of the character “a” in “anelie”
1 occurrence of the character “n” in “anelie”
2 occurrences of the character “e” in “anelie”

and of course those are also the precise number of occurrences of each of those letters in “eliane”. Hence “anelie” must be an anagram of “eliane”.

So we have found one of our anagrams. But Excel does not yet “know” that. What we need to now do is to construct the necessary syntax to instruct Excel to “match” our array of (4,5,5,5,5,4) within the larger 10-row-by-6-column matrix, shown again below:

{5,6,6,5,5,5;3,3,4,3,4,3;5,5,5,5,6,5;4,5,5,5,5,4;4,4,5,4,5,4;6,6,5,4,5,6;5,5,5,5,6,5;4,5,5,5,5,4;6,6,5,4,6,6;4,5,5,4,4,4}

Ideally we want to somehow return TRUE, based on the fact that, of the 10 1-by-6 arrays in this matrix, there is at least one which is precisely {4,5,5,5,5,4}. (There are in fact two: the 8th row in the above corresponds to the results for the string in E8, i.e. “Eleina”, which is also an anagram of “Eliane”.)

Matching single elements within single-row or single-column arrays is elementary, and precisely the task for which the MATCH function was designed. But what about matching an array within a larger array? How on earth can we do that?

And this is where MMULT comes in handy. Let’s first see how the large construction which makes up this function’s first array is resolved once we’ve set the two arrays calculated above equal to each other, i.e.:

0+(LEN(SUBSTITUTE(LOWER($E$1:$E$10),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),""))=LEN(SUBSTITUTE(LOWER(B3),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),"")))

will become:

0+({5,6,6,5,5,5;3,3,4,3,4,3;5,5,5,5,6,5;4,5,5,5,5,4;4,4,5,4,5,4;6,6,5,4,5,6;5,5,5,5,6,5;4,5,5,5,5,4;6,6,5,4,6,6;4,5,5,4,4,4}={4,5,5,5,5,4})

and, resolving this construction involving these two matrices – one a 10-row-by-6-column matrix, the other a 1-row-by-6-column matrix – will produce another 10-row-by-6-column matrix whose entries will consist of Boolean TRUE/FALSE responses to the equality query, i.e.:

0+({FALSE,FALSE,FALSE,TRUE,TRUE,FALSE;FALSE,FALSE,FALSE,FALSE,FALSE,FALSE;FALSE,TRUE,TRUE,TRUE,FALSE,FALSE;TRUE,TRUE,TRUE,TRUE,TRUE,TRUE;TRUE,FALSE,TRUE,FALSE,TRUE,TRUE;FALSE,FALSE,TRUE,FALSE,TRUE,FALSE;FALSE,TRUE,TRUE,TRUE,FALSE,FALSE;TRUE,TRUE,TRUE,TRUE,TRUE,TRUE;FALSE,FALSE,TRUE,FALSE,FALSE,FALSE;TRUE,TRUE,TRUE,FALSE,FALSE,TRUE})

which, adding the zero, gives:

{0,0,0,1,1,0;0,0,0,0,0,0;0,1,1,1,0,0;1,1,1,1,1,1;1,0,1,0,1,1;0,0,1,0,1,0;0,1,1,1,0,0;1,1,1,1,1,1;0,0,1,0,0,0;1,1,1,0,0,1}

And as you can see, the fourth and eighth rows within this matrix are the only ones to consist of a string of six ones.

Readers familiar with matrix multiplication might realise that, in order to detect whether any of the rows in this array do indeed contain six ones, we simply need to multiply this 10-row-by-6-column matrix with a 6-row-by-1-column unit matrix (i.e. {1;1;1;1;1;1}).

Recall that this 6 is not fixed, however, and is dependent on the fact that we are attempting to find anagrams for “Eliane”, a six-letter word. If we were instead looking for anagrams for “Dale”, the large matrix we would have generated would not be a 10-row-by-6-column array, as here, but rather a 10-row-by-4-column array.

Hence, to generate the single-column unit vector {1;1;1;1;1;1} (which would be {1;1;1;1} if we were dealing instead with “Dale”) dynamically, we use the construction which you can see makes up the second array being passed to MMULT, i.e.

ROW(INDIRECT("1:"&LEN(B3)))^0

which resolves to:

ROW(INDIRECT("1:"&6)))^0

which is:

{1;2;3;4;5;6}^0

i.e. {1;1;1;1;1;1}, as required.

Our MMULT construction is now:

MMULT({0,0,0,1,1,0;0,0,0,0,0,0;0,1,1,1,0,0;1,1,1,1,1,1;1,0,1,0,1,1;0,0,1,0,1,0;0,1,1,1,0,0;1,1,1,1,1,1;0,0,1,0,0,0;1,1,1,0,0,1},{1;1;1;1;1;1})

which resolves to:

{2;0;3;6;4;2;3;6;1;4}

and we can our two sixes in positions 4 and 8 which correspond to our two anagrams “Anelie” and “Eleina”.

The rest of the formula is pretty straightforward. Setting the above array equal to the length of the string in B3, i.e. 6, then multiplying the resulting array of Booleans by our first array of Booleans (which recall is simply a check on which of the strings in E1:E10 has a length equal to that in B3), so that, overall:

=OR((LEN($E$1:$E$10)=LEN(B3))*(MMULT(0+(LEN(SUBSTITUTE(LOWER($E$1:$E$10),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),""))=LEN(SUBSTITUTE(LOWER(B3),MID(LOWER(B3),TRANSPOSE(ROW(INDIRECT("1:"&LEN(B3)))),1),""))),ROW(INDIRECT("1:"&LEN(B3)))^0))=LEN(B3))

becomes:

=OR(({6;4;6;6;5;6;6;6;6;5}=6)*({2;0;3;6;4;2;3;6;1;4})=6)

which gives:

=OR({2;0;3;6;0;2;3;6;1;0}=6)

and finally:

=OR({FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE})

which is TRUE, as required.

New challenge to follow shortly. Watch this space!

3 comments

  1. Pingback: #Excel Identifying if a String is the Anagram of Another String Using the TEXTJOIN Function by David Hager | Excel For You

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s