List of unique entries from column of space-separated strings 3

Given the list below in A1:A10, we may wish to create a list of unique, single words from that list, as per column B here.

Unique From Space Separated

We can do this with the following set-up:

First we create the following in Name Manager:

Name: Range1
Refers to:

=$A$1:$A$10

Name: Arry1
Refers to:

=1+LEN(Range1)-LEN(SUBSTITUTE(Range1," ",""))

Name: Arry2
Refers to:

=ROW(INDIRECT("1:"&(MAX(Arry1)*ROWS(Range1))))

Name: Arry3
Refers to:

=INDEX(TRIM(MID(SUBSTITUTE(Range1," ",REPT(" ",999)),TRANSPOSE(999*(ROW(INDIRECT("1:"&MAX(Arry1)))-1)+1),999)),N(IF(1,1+INT((Arry2-1)/MAX(Arry1)))),N(IF(1,1+MOD(Arry2-1,MAX(Arry1)))))

The array formula in C1 is:

=SUM((Arry3<>"")/MMULT(0+(Arry3=TRANSPOSE(Arry3)),ROW(INDIRECT("1:"&COUNTA(Arry3)))^0))

This simply calculates the number of unique, single-word entries within Range1, and will be used in a clause in the main formulas so that we return blanks when we reach a row beyond that number.

The array formula in B1 is then:

=IF(ROWS($1:1)>$C$1,"",INDEX(Arry3,SMALL(IF(FREQUENCY(IF(Arry3<>"",MATCH(Arry3,Arry3,0)),Arry2),Arry2),ROWS($1:1))))

and copied down will produce the required results.

How does it work?

First let’s look at that formula in C1, which is calculating the expected number of returns. In order to understand that formula, we first need to have a look at Arry3, which forms the pivotal part of this solution. This formula itself features several important techniques, so I will attempt to explain each in detail.

Let’s look at Arry3 in full again:

=INDEX(TRIM(MID(SUBSTITUTE(Range1," ",REPT(" ",999)),TRANSPOSE(999*(ROW(INDIRECT("1:"&MAX(Arry1)))-1)+1),999)),N(IF(1,1+INT((Arry2-1)/MAX(Arry1)))),N(IF(1,1+MOD(Arry2-1,MAX(Arry1)))))

The construction using TRIM, MID, SUBSTITUTE and REPT is a standard (and extremely useful) one, which, given a string which is delimited by some character (which could be either spaces, commas, semi-colons, etc.), can be used to either extract any one of those delimited substrings, or, as here, to generate an array of all of those delimited substrings to be manipulated as we wish.

Let’s apply the formula to just the first cell in Range1 for now, just to see how it works. I’m also going to wrap the construction in SUMPRODUCT, not because we really want to sum “Amaranth”, “Bronze” and “Silver” (!), but rather so that we can coerce the array of returns (and so can follow the process using Evaluate Formula).

The formula we will be using is:

=SUMPRODUCT(TRIM(MID(SUBSTITUTE(A1," ",REPT(" ",99)),99*(ROW(INDIRECT("1:"&1+LEN(A1)-LEN(SUBSTITUTE(A1," ",""))))-1)+1,99)))

(I’m going to reduce the 999 to 99 for the purpose of this explanation. I will explain later the logic defining what choices can be made for this value.)

The key point to note about this construction is the initial substitution of all spaces within the string with, well, basically, much larger spaces. The same would be the case if the string in question were instead comma- or semi-colon-delimited: we would replace all the commas or semi-colons with these large strings of spaces.

Here, the part which generates those large strings of spaces is simply REPT(” “,99), which constitutes a string containing 99 single spaces.

Looking at the formula after those substitutions, we have:

=SUMPRODUCT(TRIM(MID("Amaranth                                 Bronze                                 Silver",99*(ROW(INDIRECT("1:"&1+LEN(A1)-LEN(SUBSTITUTE(A1," ",""))))-1)+1,99)))

i.e. lots of large spacing between our words.

Let’s look at the part now which defined the start_num for the MID function, that is:

99*(ROW(INDIRECT("1:"&1+LEN(A1)-LEN(SUBSTITUTE(A1," ",""))))-1)+1

The length of the string in A1 is 22, and the length of the string in A1 after substituting the single spaces with nothing is 20. Hence, this part:

1+LEN(A1)-LEN(SUBSTITUTE(A1," ",""))

is equal to 1+22-20, which is 3.

You might want to remember this construction, which effectively calculates the number of (space-separated) substrings within our string (there are three “words” in A1, if you like).

Hence, this part:

99*(ROW(INDIRECT("1:"&1+LEN(A1)-LEN(SUBSTITUTE(A1," ",""))))-1)+1

resolves to:

99*(ROW(INDIRECT("1:"&3))-1)+1

which of course is:

99*({0;1;2})+1

and which finally then resolves to the array {1;100;199}. We see therefore that we will be passing an array of 3 values to the MID function for its start_num parameter, not just 1.

So what’s the significance of these 3 values in relation to the string we created with those large space substitutions? And why are we using 99 for MID’s third parameter, the number of characters?

The first in that array of 3 values – {1;100;199} – is obvious enough: we start at position 1. What about 100?

Well, if you think about it, and perhaps look back up this post to the formula which resulted in all those extra spaces being inserted in the string, what exactly will the 100th character in that string be? A bit of logical reasoning tells you that, since we split the words in the original string with gaps each equal to 99 single spaces, the 100th character must lie somewhere within that first “blank zone”.

The same applies to the character at position 199: that position must be somewhere in between the second “blank zone” within that newly-created string. (If you’re not sure about this, have a think for a second or two!)

I should point out a caveat here, which some of you may already have spotted, and that is that, if there happened to be any “words” in the string with a length of more than 99 characters, then this set-up would not be guaranteed to work, since now we would have no certitude that the 100th and 199th characters within that string would reside within our “blank zones”.

And this goes back to my point earlier relating to restrictions on this choice, and why I preferred 999 in the actual formula in this post. Of course, although the probability of a string containing words of more than 999 characters in length is so small in practice as to make this choice of value virtually fail-safe, it’s nevertheless not impossible that a word of such length is present.

The sensible, guaranteed-to-work approach is to use a value which we know will always be greater in length than any of the individual words within the string. And one such choice is the length of the string in question, since clearly no individual substring can be greater in length than the entire string itself. Hence, instead of 99, or even 999, I would recommend the use of LEN(A1) in this type of construction.

The reason I have not used that approach is related to the fact that we are applying the formulas here to a range of cells (A1:A10), not just one (A1). The explanation is a little complex and I will not touch upon it here for the time being.

I hope also that it has by now become clear to the reader why we use 99 for MID’s third parameter (i.e. the number of characters) as well. This way we guarantee that, not only do each of our returns from this function begin in one of the “blank zones”, but they extend beyond the required extraction and end in the next “blank zone”, thus ensuring that all of the desired extraction is contained within that range of 99 characters (albeit surrounded by large amounts of spacing on either side). Of course, in practice we should also use LEN(A1) here to be rigorous.

Going back to our formula then, we see that:

=SUMPRODUCT(TRIM(MID("Amaranth                                 Bronze                                 Silver",99*(ROW(INDIRECT("1:"&1+LEN(A1)-LEN(SUBSTITUTE(A1," ",""))))-1)+1,99)))

becomes:

=SUMPRODUCT(TRIM(MID("Amaranth                                 Bronze                                 Silver",{1;100;199},99)))

which is:

=SUMPRODUCT(TRIM({"Amaranth                         ";"        Bronze                   ";"              Silver"}))

and the reason for the TRIM becomes clear, since this resolves to:

=SUMPRODUCT({"Amaranth";"Bronze";"Silver"})

as required (apart from the SUMPRODUCT of course, though as I said that was added merely to coerce the array of returns from this expression).

Let’s now go back to Arry3, then. Also, note that from now on I will not be displaying these constructions in an array-processing function – such as SUMPRODUCT – as before. However, I strongly recommend that readers do just that so that they can follow the steps using the Evaluate Formula tool. I personally always wrap my formulas in SUM and array-enter it when I want to coerce an array of returns. Even if the entries in question are text, it’s not important that the formula errors, but that we are able to coerce the necessary array.

=INDEX(TRIM(MID(SUBSTITUTE(Range1," ",REPT(" ",999)),TRANSPOSE(999*(ROW(INDIRECT("1:"&MAX(Arry1)))-1)+1),999)),N(IF(1,1+INT((Arry2-1)/MAX(Arry1)))),N(IF(1,1+MOD(Arry2-1,MAX(Arry1)))))

Bear in mind that we are not passing a single cell to this TRIM(MID(SUBSTITUTE set-up this time, but a range of cells. Let’s first look at the construction which creates the array of starting positions for the MID function here, which comes from:

TRANSPOSE(999*(ROW(INDIRECT("1:"&MAX(Arry1)))-1)+1)

To break that down we need to first look at Arry1, which was defined as:

1+LEN(Range1)-LEN(SUBSTITUTE(Range1," ",""))

We saw this construction before when we were looking at A1 only. The difference here is that we are now passing not just one cell but several to this construction. We can expand it out by putting the actual cell contents in:

1+LEN({"Amaranth Bronze Silver";"Bronze";"";"Violet Bronze Amaranth";"Red";"Puce Bronze";"Taupe Ochre Bronze Cerise";"Silver Red Cerise Orange";"";"Cerise"})-LEN(SUBSTITUTE({"Amaranth Bronze Silver";"Bronze";"";"Violet Bronze Amaranth";"Red";"Puce Bronze";"Taupe Ochre Bronze Cerise";"Silver Red Cerise Orange";"";"Cerise"}," ",""))

which will resolve to:

1+{22;6;0;22;3;11;25;24;0;6}-{20;6;0;20;3;10;22;21;0;6}

i.e.:

{3;1;1;3;1;2;4;4;1;1}

which is of course the number of individual words in each of the cells in Range1, apart from for those cells which are empty in that range (rows 3 and 9) for which this construction incorrectly returns a value of 1. However, since the only purpose of constructing this array in this case is to then take its maximum value, we are not so concerned with these inconsistent results. In other situations we may well be so concerned, and the reader is invited to investigate means by which this array can be correctly resolved so that any empty cells within the range return 0, not 1.

Hence our construction which creates the array of starting positions for the MID function now looks like:

TRANSPOSE(999*(ROW(INDIRECT("1:"&MAX({3;1;1;3;1;2;4;4;1;1})))-1)+1)

which resolves to:

TRANSPOSE({1;1000;1999;2998})

which naturally is:

{1,1000,1999,2998}

If readers aren’t sure whether an array needs to be transposed or not, here is a good example. First, let’s see what would happen if we didn’t transpose this array, so instead of looking at the correct:

TRIM(MID(SUBSTITUTE(Range1," ",REPT(" ",999)),{1,1000,1999,2998},999))

we were actually calculating:

TRIM(MID(SUBSTITUTE(Range1," ",REPT(" ",999)),{1;1000;1999;2998},999))

then this would resolve to:

TRIM(MID({"Amaranth                                 Bronze                                 Silver";"Bronze";"";"Violet                                 Bronze                                 Amaranth";"Red";"Puce                                 Bronze";"Taupe                                 Ochre                                 Bronze                                 Cerise";"Silver                                 Red                                 Cerise                                 Orange";"";"Cerise"},{1;1000;1999;2998},999))

Now, the important thing to note here (and to get used to if you aren’t already) is that our two arrays – that consisting of the space-separated entries as the first argument for our MID function and our four start_num parameters that we have just generated – are here both single-column arrays.

That the first array is such is obvious: the range A1:A10 is just precisely that, and since the only action we have thus far carried out with regards to the that range is to replace all the spaces in the entries with larger spaces, we can rest assured that we will not in any way have tampered with those dimensions.

The second – the array {1;1000;1999;2998} – does not actually appear in the worksheet anywhere, though if we recall that Excel represents row- and column-separators with semi-colons and commas respectively then we can see that this array also represents a single-column vector, a 4-row-by-1-column array, if you like.

But neither should we think that this array being a single-column vector, and not a single-row vector, is any sort of random occurrence. If you remember, the function which generated this array was of the form ROW(INDIRECT(…, and so it is not surprising that an array generated by the ROW function should consist of multiple row entries, rather than multiple column entries.

Once we’ve understood this distinction, we then need to see what this means in terms of how Excel operates with respect to both, that is: how is a function (such as the MID here) resolved when both arrays (for example the 1st and 2nd parameters, as here) are of the same displacement – either both are single-row arrays or both are single-column arrays – and how is it resolved when the two arrays are orthogonal – one is a single-row array, the other a single-column array?

In the first case, if both parameters are of the same vector-type, then Excel attempts to “pair up” corresponding elements within each of those arrays in order to resolve the function. If we follow the resolution of the “incorrect” version above using Evaluate Formula, we might be a bit perplexed on seeing that:

TRIM(MID({"Amaranth                                 Bronze                                 Silver";"Bronze";"";"Violet                                 Bronze                                 Amaranth";"Red";"Puce                                 Bronze";"Taupe                                 Ochre                                 Bronze                                 Cerise";"Silver                                 Red                                 Cerise                                 Orange";"";"Cerise"},{1;1000;1999;2998},999))

resolves to:

TRIM({"Amaranth                         ";"";"";"";#N/A;#N/A;#N/A;#N/A;#N/A;#N/A})

But as soon as we begin to appreciate how Excel operates with regards to the dimensions of the array(s) passed to it, then this actually makes perfect sense.

As I said, since here both parameters are of the same vector-type (single-column arrays), Excel calculates corresponding pairs of elements in turn, so that, effectively, the first element from the array of start_nums – {1;1000;1999;2998} – , i.e. 1, will apply to the first element in the space-separated array above, i.e.:

"Amaranth                                 Bronze                                 Silver"

The second element (1000) to the second element in that array (“Bronze” – remember, no spacing in that one after our substitutions; there were no spaces in the original cell contents).

And so on and so on.

Each will of course receive the same num_chars parameter (there is only one) of 999.

What’s more, when Excel gets to the fifth member of the array, it has a problem, inasmuch as we now have no corresponding fifth start_num parameter to “pair up” with “Red”. In this case, rather than simply disallowing the whole formula construct, Excel rather leniently decides to artificially increase the smaller of the two arrays (i.e. that containing the four start_num parameters) so that it is of an equal dimension to the first array (i.e. a 10-row-by-1-column array).

Of course, it chooses to fill these missing elements with #N/As (what else should it do?), but it could be argued that this is preferable to rejecting the entire formula out of hand. Indeed, there may well be situations where we knowingly accept that one of our arrays is being “re-dimensioned”, even filled with errors, providing that we then manipulate the resulting array as befits our needs.

Now let’s take a look at what happens if we first apply a transposition to the array of four start_num parameters, which will put our formula construction in the category of having one single-column vector and one single-row vector, i.e. the two are orthogonal.

This time:

TRIM(MID(SUBSTITUTE(Range1," ",REPT(" ",999)),{1,1000,1999,2998},999))

(note the commas in the array of start_nums now) will resolve, correctly in this case, to:

TRIM({"Amaranth                         ","        Bronze                   ","              Silver","";"Bronze","","","";"","","","";"Violet                           ","      Bronze                     ","            Amaranth","";"Red","","","";"Puce                             ","    Bronze","","";"Taupe                            ","     Ochre                       ","          Bronze                 ","                Cerise";"Silver                           ","      Red                        ","         Cerise                  ","               Orange";"","","","";"Cerise","","",""})

How did we achieve this – much larger – array of returns simply by transposing one of the vectors?

The first thing to ask ourselves is: just what sort of array have we returned? What are its dimensions and how does that compare with the previous version in which the two arrays were both single-column vectors?

Looking at the above, and remembering the convention regarding semi-colons and commas, we see that the resulting array is actually one consisting of 10 rows and 4 columns, i.e. 40 elements in total.

In fact, it is no mere coincidence that we have an array measuring 10 by 4: unlike with the previous attempt we have now successfully managed to apply all four of the start_num parameters to each of the 10 space-separated strings which we passed to the MID function.

What has happened is that, by first transposing one of the vectors, we have effectively instructed Excel to compute the MID function over a “two-dimensional” array, equivalent to if we were to lay it out on paper as such (I’ll leave out the TRIM function here):

List of unique entries from column of space-separated strings 4

And so you can see that we have achieved our desired output, which becomes the more evident after we now pass these 40 elements to TRIM:

{"Amaranth","Bronze","Silver","";"Bronze","","","";"","","","";"Violet","Bronze","Amaranth","";"Red","","","";"Puce","Bronze","","";"Taupe","Ochre","Bronze","Cerise";"Silver","Red","Cerise","Orange";"","","","";"Cerise","","",""}

Having managed to create an array of all individual substrings (or words) from the original range, we can now begin to think about manipulating the entries within this array to suit our needs. And here’s where I’d like to just step out of the formula breakdowns for a while to think about just what exactly it is that we’re going to do with this array now that we’ve gone to all the effort to generate it.

Well, since the whole premise of this post rests on the idea of creating a unique list from that array, we have to recall which techniques are (generally) available to us for such a task.

Readers may be familiar with either, or both, of the standard constructions for generating a unique list given either a single-row or single-column array of text entries, i.e.:

Hungarian Towns

The value in B2 calculates the number of expected returns, and is obtained via:

=SUMPRODUCT((A2:A10<>"")/(COUNTIF(A2:A10,A2:A10&"")))

We can then use either the “FREQUENCY” approach, viz this array formula in D2:

=IF(ROWS($1:1)>$B$2,"",INDEX($A$2:$A$10,SMALL(IF(FREQUENCY(IF($A$2:$A$10<>"",MATCH($A$2:$A$10,$A$2:$A$10,0)),ROW($A$2:$A$10)-MIN(ROW($A$2:$A$10))+1),ROW($A$2:$A$10)-MIN(ROW($A$2:$A$10))+1),ROWS($1:1))))

or the “COUNTIF” approach, viz this array formula in E2:

=IF(ROWS($1:1)>$B$2,"",INDEX($A$2:$A$10,MATCH(0,IF($A$2:$A$10<>"",COUNTIF(E$1:E1,$A$2:$A$10&"")),0)))

I personally use the former in all situations. Not only do I have an irking dislike for the idea that a formula “has to” start in row 2 or beyond, but, more practically, I believe that it offers more flexibility (I also understand that, despite its extra function calls, this set-up may actually be quicker than the COUNTIF version). And in cases where the array from which we wish to obtain the unique results is one derived from some other subfunctions within the formula (as is the case here), we in any case have no choice, since COUNTIF suffers from the drawback that the range passed to it must be a reference to an actual worksheet range.

So, it looks like FREQUENCY it is then. We now have to consider whether the array which we have generated – a 10-row-by-4-column matrix – can equally be passed as the range to this set-up. Will our functions here within this FREQUENCY construction, principally designed to work with single-column or -row ranges, operate equally on a two-dimensional array?

The answer, unfortunately, is no. Although INDEX, SMALL and FREQUENCY are perfectly capable of processing such arrays, MATCH is not: the lookup_array passed to this function must be either a single-row or single-column vector.

So we are left with the choice of either abandoning this set-up – and seeking some other means by which we can extract our unique entries from this matrix – or manipulating the current array so that it can be processed by MATCH. And this is the choice I made here, and which I will now demonstrate. The technique involved is an extremely useful one since, in essence, it allows us to take any two-dimensional array and re-dimension it into a single-row vector.

To recall, we are currently at the following stage in our dissection of Arry3:

INDEX({"Amaranth","Bronze","Silver","";"Bronze","","","";"","","","";"Violet","Bronze","Amaranth","";"Red","","","";"Puce","Bronze","","";"Taupe","Ochre","Bronze","Cerise";"Silver","Red","Cerise","Orange";"","","","";"Cerise","","",""},N(IF(1,1+INT((Arry2-1)/MAX(Arry1)))),N(IF(1,1+MOD(Arry2-1,MAX(Arry1)))))

The parts I want to look at now are the two constructions used here for the row and column parameters of INDEX, i.e.:

1+INT((Arry2-1)/MAX(Arry1))

and

1+MOD(Arry2-1,MAX(Arry1))

(I will come to the N(IF(1 part shortly.)

Both of these reference Arry2, so let’s first break down that construction, i.e.:

=ROW(INDIRECT("1:"&(MAX(Arry1)*ROWS(Range1))))

We previously calculated the maximum value in Arry1 as being 4, and the number of rows in Range1 is simply 10, so this is:

ROW(INDIRECT("1:"&40))

and so we see that Arry2 is simply a single-column array consisting of the integers from 1 to 40:

{1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23;24;25;26;27;28;29;30;31;32;33;34;35;36;37;38;39;40}

Substituting this and the value of MAX(Arry1) which we have already calculated into the row_num construction above means that:

1+INT((Arry2-1)/MAX(Arry1))

becomes:

1+INT(({1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23;24;25;26;27;28;29;30;31;32;33;34;35;36;37;38;39;40}-1)/4)

which is:

1+INT(({0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23;24;25;26;27;28;29;30;31;32;33;34;35;36;37;38;39})/4)

i.e.:

1+INT({0;0.25;0.5;0.75;1;1.25;1.5;1.75;2;2.25;2.5;2.75;3;3.25;3.5;3.75;4;4.25;4.5;4.75;5;5.25;5.5;5.75;6;6.25;6.5;6.75;7;7.25;7.5;7.75;8;8.25;8.5;8.75;9;9.25;9.5;9.75})

i.e.:

1+{0;0;0;0;1;1;1;1;2;2;2;2;3;3;3;3;4;4;4;4;5;5;5;5;6;6;6;6;7;7;7;7;8;8;8;8;9;9;9;9}

and so finally:

{1;1;1;1;2;2;2;2;3;3;3;3;4;4;4;4;5;5;5;5;6;6;6;6;7;7;7;7;8;8;8;8;9;9;9;9;10;10;10;10}

I will leave it to the reader to verify that the similar-looking expression for the col_num parameter resolves to:

{1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4}

Since these two arrays are both of the same vector displacement (i.e. they are both single-column arrays), we know that, when passed to INDEX for processing, corresponding entries within these arrays will be “paired”, so that we will be instructing INDEX to return an array of values whose row_num and col_num parameters will be, successively: 1/1, 1/2, 1/3, 1/4, 2/1, 2/2, 2/3, 2/4, 3/1, …, etc.

In fact, the whole point is that we will be indexing each and every one of the 40 elements within the array precisely once.

This operation may at first seem largely pointless. Why return an array of values consisting of all values from our array? Until of course you realise that:

INDEX({"Amaranth","Bronze","Silver","";"Bronze","","","";"","","","";"Violet","Bronze","Amaranth","";"Red","","","";"Puce","Bronze","","";"Taupe","Ochre","Bronze","Cerise";"Silver","Red","Cerise","Orange";"","","","";"Cerise","","",""},{1;1;1;1;2;2;2;2;3;3;3;3;4;4;4;4;5;5;5;5;6;6;6;6;7;7;7;7;8;8;8;8;9;9;9;9;10;10;10;10},{1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4;1;2;3;4})

will become:

{"Amaranth";"Bronze";"Silver";"";"Bronze";"";"";"";"";"";"";"";"Violet";"Bronze";"Amaranth";"";"Red";"";"";"";"Puce";"Bronze";"";"";"Taupe";"Ochre";"Bronze";"Cerise";"Silver";"Red";"Cerise";"Orange";"";"";"";"";"Cerise";"";"";""}

And if we just display this for a moment alongside the original array, prior to passing to this INDEX construction, i.e.:

{"Amaranth","Bronze","Silver","";"Bronze","","","";"","","","";"Violet","Bronze","Amaranth","";"Red","","","";"Puce","Bronze","","";"Taupe","Ochre","Bronze","Cerise";"Silver","Red","Cerise","Orange";"","","","";"Cerise","","",""}

we see that we have precisely the same entries, and number of entries, though whereas the latter comprises our 10-row-by-4-column matrix, which recall we are unable to pass to MATCH, the former constitutes a single, 40-row-by-1-column array, which as such we can happily manipulate as we wish, including passing to MATCH.

For a discussion on the use of the construction N(IF(1, necessary here to force INDEX to return an array of values, see here.

Now that we have our Arry3, we can return to the array formula in C1, which is:

=SUM((Arry3<>"")/MMULT(0+(Arry3=TRANSPOSE(Arry3)),ROW(INDIRECT("1:"&COUNTA(Arry3)))^0))

Recall that this formula is designed as a one-off to determine the number of expected returns from our array. The standard construction in these cases – i.e. determining the number of unique entries within a given range – is to use a combination of SUMPRODUCT and COUNTIF. Unfortunately, the array passed to COUNTIF must be a reference to an actual worksheet range, which is not the case here. Hence we must look to alternative set-ups.

The numerator here is straightforward: an array of Boolean TRUE/FALSE returns as to whether each entry in Arry3 is blank or not:

{TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE;FALSE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE}

So let’s concentrate on the construction in the denominator. I will come to the first array being passed to MMULT shortly, but first let’s look at the second:

ROW(INDIRECT("1:"&COUNTA(Arry3)))^0

We know that the number of elements in Arry3 is 40, so this resolves to:

{1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;23;24;25;26;27;28;29;30;31;32;33;34;35;36;37;38;39;40}^0

and, if you didn’t already know, since any number raised to the power of 0 is 1, this will give us a single-column array whose 40 elements are all unity:

{1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1;1}

Let’s now look at the first array being passed to MMULT, i.e.

0+(Arry3=TRANSPOSE(Arry3))

It will be difficult to demonstrate the technique here using an array the size of Arry3. Although that array is a vector of length “only” 40, the transposition in this construction will effectively produce a 40×40 array of returns. To facilitate the explanation, let’s temporarily reduce Range1 to just two cells, i.e. A1:A2, in which case our Arry3 would be:

{"Amaranth";"Bronze";"Silver";"Bronze";"";""}

The first array in our MMULT would then look like:

0+({"Amaranth";"Bronze";"Silver";"Bronze";"";""}={"Amaranth","Bronze","Silver","Bronze","",""})

where you can see that the two arrays are now orthogonal (one a single-column vector, the other a single-row vector).

The six elements in each will now form a 6×6 matrix of entries consisting of Boolean TRUE/FALSE responses to the equality query, i.e.:

{TRUE,FALSE,FALSE,FALSE,FALSE,FALSE;FALSE,TRUE,FALSE,TRUE,FALSE,FALSE;FALSE,FALSE,TRUE,FALSE,FALSE,FALSE;FALSE,TRUE,FALSE,TRUE,FALSE,FALSE;FALSE,FALSE,FALSE,FALSE,TRUE,TRUE;FALSE,FALSE,FALSE,FALSE,TRUE,TRUE}

Adding the zero to convert these to numerics will make things easier to comprehend:

{1,0,0,0,0,0;0,1,0,1,0,0;0,0,1,0,0,0;0,1,0,1,0,0;0,0,0,0,1,1;0,0,0,0,1,1}

Hence, our MMULT is now:

MMULT({1,0,0,0,0,0;0,1,0,1,0,0;0,0,1,0,0,0;0,1,0,1,0,0;0,0,0,0,1,1;0,0,0,0,1,1},{1;1;1;1;1;1})

and hopefully readers will know that this resolves to the array {1;2;1;2;2;2}. The point here is that we have achieved an identical result to that we would have obtained if were able to use the standard SUMPRODUCT/COUNTIF construction, since, when we now add back the nominator (here also calculated over just A1:A2) we get:

=SUM({TRUE;TRUE;TRUE;TRUE;FALSE;FALSE}/{1;2;1;2;2;2})

which resolves to:

=SUM({1;0.5;1;0.5;0;0})

which is 3, as would be required if in fact Range1 was just A1:A2. In our case, the matrices involved are obviously a lot larger, and certainly the first array in MMULT (a 40×40 matrix) is far too large to give in full here. Nevertheless, the principle is the same, and suffice to say that the formula in C1 will resolve to:

=SUM({TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;TRUE;TRUE;TRUE;FALSE;TRUE;FALSE;FALSE;FALSE;TRUE;TRUE;FALSE;FALSE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE;TRUE;FALSE;FALSE;FALSE;FALSE;TRUE;FALSE;FALSE;FALSE}/{2;5;2;21;5;21;21;21;21;21;21;21;1;5;2;21;2;21;21;21;1;5;21;21;1;1;5;3;2;2;3;1;21;21;21;21;3;21;21;21})

which is:

=SUM({0.5;0.2;0.5;0;0.2;0;0;0;0;0;0;0;1;0.2;0.5;0;0.5;0;0;0;1;0.2;0;0;1;1;0.2;0.333333333333333;0.5;0.5;0.333333333333333;1;0;0;0;0;0.333333333333333;0;0;0})

i.e. 10.

And we’re finally now in a position to put all these calculated results into the main formula in B1, i.e.:

=IF(ROWS($1:1)>$C$1,"",INDEX(Arry3,SMALL(IF(FREQUENCY(IF(Arry3<>"",MATCH(Arry3,Arry3,0)),Arry2),Arry2),ROWS($1:1))))

which, if you look closely, is nothing other than the standard FREQUENCY construction outlined above in the Hungarian cities example.

In essence, the whole problem of returning this unique list was not so much deciding on which construction to use (unless I’m mistaken there is no real alternative than the one outlined above using FREQUENCY). Rather it was the inherent complexity in first generating the necessary array of substrings from the original range of space-delimited strings, and then re-dimensioning this array so that we were able to further manipulate it in the required manner.

In fact, it may well be that this approach is the only viable one given any non-standard scenario in which we wish to generate a unique list of returns. We simply do whatever we can to coerce the original range into a single, one-dimensional array so that, in effect, we can treat it just as if it were a run-of-the-mill single-column worksheet range. And that’s pretty much all this technique boils down to!

3 comments

  1. Yes, this is a little difficult with Excel formula, simpler with VBA and easy with Power Query….but only for Excel 2010 and 2013.

    Greetings

  2. Thanks, Bill. And I’m sure you’re right. It’s most likely easier with VBA as well, as probably are most of the formula-based solutions that I (we) come up with.

    But I guess that’s exactly what I’m trying to do, and also what I enjoy doing.

    What I mean is, I’m fully aware that there are better, more efficient methods in Excel (Pivot Tables, Power Queries, VBA, etc.) and using alternative software as well.

    But personally I like to push the limits of what is possible using Excel formulas alone. If that’s not the most “efficient” method of solving the problem then I’m not so concerned. And I sincerely believe there’s a lot more that worksheet formulas can achieve than most people realise (I actually don’t think my solution above is overly-complicated, once you start to understand the concepts involved).

    And I guess that’s what I’m trying to do with this site. To share some of those (admittedly sometimes convoluted-looking) formula solutions to problems which people would perhaps otherwise imagine only possible using some other means, e.g. VBA.

    And hopefully there are a few people out there who can appreciate that philosophy, and perhaps even learn something from my esoteric ramblings. Even better if I can also learn from them, of course!

    Regards

  3. Pingback: Break down a column of multiple strings per cell into unqiue single cell strings - HTML CODE

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s