Unique, Alphabetical List from Several Columns 25

In this post I shall present a method for generating a unique, alphabetical list in a single column from data contained within a contiguous range comprising several columns.

For example, given the dataset below in A2:E5, we will return that list beginning in cell G1:

Unique, Alphabetical List from Several Columns

We do this as follows:

First go to Name Manager and define:

Range1 as:

=$A$2:$E$5

Arry1 as:

=ROW(INDIRECT("1:"&COLUMNS(Range1)*ROWS(Range1)))

Arry2 as:

=1+INT((Arry1-1)/COLUMNS(Range1))

Arry3 as:

=1+MOD(Arry1-1,COLUMNS(Range1))

Arry4 as:

=INDEX(Range1,N(IF(1,Arry2)),N(IF(1,Arry3)))

The formula in H1, used to determine the number of returns, is:

=SUMPRODUCT((Range1<>"")/COUNTIF(Range1,Range1&""))

and the array formula in G1 is:

=IF(ROWS($1:1)>$H$1,"",INDEX(Arry4,MATCH(SMALL(IF(FREQUENCY(IF(Range1<>"",MATCH(Range1,Arry4,0)),Arry1),COUNTIF(Range1,"<"&Arry4)),ROWS($1:1)),IF(Arry4<>"",COUNTIF(Range1,"<"&Arry4)),0)))

How does it work?

The formula in H1 is reasonably straightforward, and a standard construction in such cases. We can see that:

=SUMPRODUCT((Range1<>"")/COUNTIF(Range1,Range1&""))

resolves to:

=SUMPRODUCT(({"Due","","Otto","","Otto";"","","","","Tre";"Sei","Cinque","","Quattro","Otto";"Due","","","Quattro","Otto"}<>"")/COUNTIF(Range1,Range1&""))

which is:

=SUMPRODUCT({TRUE,FALSE,TRUE,FALSE,TRUE;FALSE,FALSE,FALSE,FALSE,TRUE;TRUE,TRUE,FALSE,TRUE,TRUE;TRUE,FALSE,FALSE,TRUE,TRUE}/COUNTIF(Range1,Range1&""))

and then resolving the COUNTIF portion, which is simply counting how many times each of the entries in Range1 occurs within that range, gives:

=SUMPRODUCT({TRUE,FALSE,TRUE,FALSE,TRUE;FALSE,FALSE,FALSE,FALSE,TRUE;TRUE,TRUE,FALSE,TRUE,TRUE;TRUE,FALSE,FALSE,TRUE,TRUE}/{2,9,4,9,4;9,9,9,9,1;1,1,9,2,4;2,9,9,2,4})

which becomes, after the division:

=SUMPRODUCT({0.5,0,0.25,0,0.25;0,0,0,0,1;1,1,0,0.5,0.25;0.5,0,0,0.5,0.25})

i.e. 6, as required.

And so, by referencing this value in the main formula as follows:

=IF(ROWS($1:1)>$H$1,"",

we know that, beyond the sixth row to which we copy this formula we will return a blank, as required.

So let’s now look at the main clause of that formula then, which recall is:

INDEX(Arry4,MATCH(SMALL(IF(FREQUENCY(IF(Range1<>"",MATCH(Range1,Arry4,0)),Arry1),COUNTIF(Range1,"<"&Arry4)),ROWS($1:1)),IF(Arry4<>"",COUNTIF(Range1,"<"&Arry4)),0))

This is actually a relatively standard construction for extracting unique, alphabetical values, the only (and important) difference being that our range is not a single-column, one-dimensional range, but rather a two-dimensional one.

However, in principle the technique is the same: what is first required is some “re-dimensioning” of our two-dimensional range into one of a single dimension, after which we can apply the usual constructions to obtain our desired results.

So let’s first look at how we achieve that then, which is given by Arry4, i.e.:

INDEX(Range1,N(IF(1,Arry2)),N(IF(1,Arry3)))

Here, we are simply going to INDEX each and every element in our two-dimensional range. However, we are going to do this in such a way so that the resulting array, although containing precisely the same elements as that two-dimensional range, will in fact comprise an array consisting of just one dimension, i.e. a single-column vector.

In order to deconstruct Arry4, we need to first look at Arry2 and Arry3, which contribute to the row_num and column_num of the INDEX function respectively.

Each of these references Arry1, so let’s first look at that, which is:

=ROW(INDIRECT("1:"&COLUMNS(Range1)*ROWS(Range1)))

The number of columns and rows in Range1 are 5 and 4 respectively, so this becomes:

ROW(INDIRECT("1:"&5*4))

which is simply:

{1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20}

i.e. an array of integers from 1 up to the product of the number of rows and number of columns in our original dataset.

We can now look at Arry2, which we defined as:

=1+INT((Arry1-1)/COLUMNS(Range1))

Substituing in Arry1 and also the number of columns in Range1:

1+INT(({1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20}-1)/5)

which is:

1+INT({0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19}/5)

and resolving first the division:

1+INT({0;0.2;0.4;0.6;0.8;1;1.2;1.4;1.6;1.8;2;2.2;2.4;2.6;2.8;3;3.2;3.4;3.6;3.8})

and then taking integer values gives us:

1+{0;0;0;0;0;1;1;1;1;1;2;2;2;2;2;3;3;3;3;3}

i.e.:

{1;1;1;1;1;2;2;2;2;2;3;3;3;3;3;4;4;4;4;4}

Before I explain the reason for generating this array, let’s first look at Arry3, which is in any case a very similar construction, though using MOD in place of INT, i.e.:

=1+MOD(Arry1-1,COLUMNS(Range1))

And, similarly to the resolution of Arry2, this becomes:

1+MOD({0;1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19},5)

i.e.:

1+{0;1;2;3;4;0;1;2;3;4;0;1;2;3;4;0;1;2;3;4}

which is finally:

{1;2;3;4;5;1;2;3;4;5;1;2;3;4;5;1;2;3;4;5}

If we look at Arry4 now, inserting these arrays for our INDEX parameters, we have:

INDEX(Range1,N(IF(1,{1;1;1;1;1;2;2;2;2;2;3;3;3;3;3;4;4;4;4;4})),N(IF(1,{1;2;3;4;5;1;2;3;4;5;1;2;3;4;5;1;2;3;4;5})))

Readers who have read my post here will be aware of this technique for coercing INDEX to return an array of values, and so the above becomes:

INDEX(Range1,{1;1;1;1;1;2;2;2;2;2;3;3;3;3;3;4;4;4;4;4},{1;2;3;4;5;1;2;3;4;5;1;2;3;4;5;1;2;3;4;5})

And hopefully the reason for generating these arrays for each of INDEX’s parameters now becomes clear since, if we look at pairs from these two arrays which we will be passing to INDEX, we see that the above construction is equivalent to performing each of:

INDEX(Range1,1,1)
INDEX(Range1,1,2)
INDEX(Range1,1,3)
INDEX(Range1,1,4)
INDEX(Range1,1,5)
INDEX(Range1,2,1)
INDEX(Range1,2,2)

all the way up to:

INDEX(Range1,4,5)

In effect, then, we are doing nothing other than the equivalent of INDEXing each and every element in our array precisely once.

Hence, the end result is that Arry4 will be:

{"Due";"";"Otto";"";"Otto";"";"";"";"";"Tre";"Sei";"Cinque";"";"Quattro";"Otto";"Due";"";"";"Quattro";"Otto"}

Now, if I just temporarily present this alongside the Excel interpretation of Range1, which is:

{"Due","","Otto","","Otto";"","","","","Tre";"Sei","Cinque","","Quattro","Otto";"Due","","","Quattro","Otto"}

we can see that there is no difference whatsoever in the values within these two arrays. The only thing which is different is that, whereas Range1 comprises a 4-row-by-5-column two-dimensional array, Arry4, which was derived by simply INDEXing each and every element in Range1, is actually a one-dimensional 20-row-by-1-column range.

Of course, we could equally have made it so that Arry4 was 1-row-by-20-column range, but that’s a moot point: whichever we choose, the net result is that we have achieved the “re-dimensioning” of our range from one of two dimensions to one of just one.

This means that we are now at liberty to operate on this array with all the usual techniques at our disposal which are applicable to one-dimensional ranges though which fail when extended to arrays comprising e.g. multiple columns.

If we return to our main construction:

INDEX(Arry4,MATCH(SMALL(IF(FREQUENCY(IF(Range1<>"",MATCH(Range1,Arry4,0)),Arry1),COUNTIF(Range1,"<"&Arry4)),ROWS($1:1)),IF(Arry4<>"",COUNTIF(Range1,"<"&Arry4)),0))

we can now start to tackle it with no fear of our two-dimensional range being an issue.

Taking this part first:

IF(Range1<>"",MATCH(Range1,Arry4,0))

and first resolving the condition being passed to IF:

IF({TRUE,FALSE,TRUE,FALSE,TRUE;FALSE,FALSE,FALSE,FALSE,TRUE;TRUE,TRUE,FALSE,TRUE,TRUE;TRUE,FALSE,FALSE,TRUE,TRUE},MATCH(Range1,Arry4,0))

and then inserting our known arrays Range1 and Arry4 into the MATCH function:

IF({TRUE,FALSE,TRUE,FALSE,TRUE;FALSE,FALSE,FALSE,FALSE,TRUE;TRUE,TRUE,FALSE,TRUE,TRUE;TRUE,FALSE,FALSE,TRUE,TRUE},MATCH({"Due","","Otto","","Otto";"","","","","Tre";"Sei","Cinque","","Quattro","Otto";"Due","","","Quattro","Otto"},{"Due";"";"Otto";"";"Otto";"";"";"";"";"Tre";"Sei";"Cinque";"";"Quattro";"Otto";"Due";"";"";"Quattro";"Otto"},0))

Resolving this array of MATCH functions:

IF({TRUE,FALSE,TRUE,FALSE,TRUE;FALSE,FALSE,FALSE,FALSE,TRUE;TRUE,TRUE,FALSE,TRUE,TRUE;TRUE,FALSE,FALSE,TRUE,TRUE},{1,#N/A,3,#N/A,3;#N/A,#N/A,#N/A,#N/A,10;11,12,#N/A,14,3;1,#N/A,#N/A,14,3})

and then finally the IF function gives us the following array:

{1,FALSE,3,FALSE,3;FALSE,FALSE,FALSE,FALSE,10;11,12,FALSE,14,3;1,FALSE,FALSE,14,3}

This array is then passed to FREQUENCY as its first parameter, with Arry1 as its second, so that:

FREQUENCY(IF(Range1<>"",MATCH(Range1,Arry4,0)),Arry1)

becomes:

FREQUENCY({1,FALSE,3,FALSE,3;FALSE,FALSE,FALSE,FALSE,10;11,12,FALSE,14,3;1,FALSE,FALSE,14,3},Arry1)

and putting in Arry1 in full:

FREQUENCY({1,FALSE,3,FALSE,3;FALSE,FALSE,FALSE,FALSE,10;11,12,FALSE,14,3;1,FALSE,FALSE,14,3},{1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20})

and finally resolving this FREQUENCY construction produces the following array:

{2;0;4;0;0;0;0;0;0;1;1;1;0;2;0;0;0;0;0;0;0}

This is quite a standard technique we have used: the position of the non-zero values in the above array represent the first occurrence within that array of each distinct value in our range, and so gives us a means by which to return unique values only.

And so we use this array as the condition within an IF statement:

IF(FREQUENCY(IF(Range1<>"",MATCH(Range1,Arry4,0)),Arry1),COUNTIF(Range1,"<"&Arry4))

which is now:

IF({2;0;4;0;0;0;0;0;0;1;1;1;0;2;0;0;0;0;0;0;0},COUNTIF(Range1,"<"&Arry4))

The COUNTIF, which is going to be used to determine our alphabetic sorting, can now be resolved:

IF({2;0;4;0;0;0;0;0;0;1;1;1;0;2;0;0;0;0;0;0;0},{1;0;3;0;3;0;0;0;0;10;9;0;0;7;3;1;0;0;7;3})

which becomes:

{1;FALSE;3;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;10;9;0;FALSE;7;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE}

Our construction now looks like:

INDEX(Arry4,MATCH(SMALL({1;FALSE;3;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;10;9;0;FALSE;7;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE;FALSE},ROWS($1:1)),IF(Arry4<>"",COUNTIF(Range1,"<"&Arry4)),0))

and, resolving the SMALL function, whose k parameter is here given by ROWS($1:1), i.e. 1:

INDEX(Arry4,MATCH(0,IF(Arry4<>"",COUNTIF(Range1,"<"&Arry4)),0))

The COUNTIF construction we have just dissected, so we can insert that here as well:

INDEX(Arry4,MATCH(0,IF(Arry4<>"",{1;0;3;0;3;0;0;0;0;10;9;0;0;7;3;1;0;0;7;3},0))

and resolving the IF statement:

INDEX(Arry4,MATCH(0,{1;FALSE;3;FALSE;3;FALSE;FALSE;FALSE;FALSE;10;9;0;FALSE;7;3;1;FALSE;FALSE;7;3},0))

which is clearly:

INDEX(Arry4,12)

And if we put in the values from Arry4:

INDEX({"Due";"";"Otto";"";"Otto";"";"";"";"";"Tre";"Sei";"Cinque";"";"Quattro";"Otto";"Due";"";"";"Quattro";"Otto"},12)

which is “Cinque”, as required.

25 comments

  1. In Google Sheets, you could perform this task with

    =ArrayFormula(sort(unique(transpose(split(concatenate(A2:E5&char(9)),char(9))))))

    You add char(9) to the end of all the cells, then bring them together with concatenate and split them to create one row of values. Then you transpose, remove duplicates with the unique function, and sort it. If you wanted to limit to 6, since this is text you could use

    =ArrayFormula(query(sort(unique(transpose(split(concatenate(A2:E5&char(9)),char(9))))),"limit "&H1&""))
  2. Thanks again, Isai!

    And yet again, you leave me thinking that I must be missing out on something by not having made the transition/extension to Google Sheets – everything looks so effortless!

    Even if I can’t fully appreciate the solutions you post here, I sincerely hope that there will be others who visit this site who can.

    Many thanks and regards

  3. Hello

    If I translate this part …
    MATCH(Range1,Arry4,0)
    … into German and resolve it the result is:
    {1.#NV.3.#NV.3;#NV.#NV.#NV.#NV.10;11.12.#NV.14.3;1.#NV.#NV.14.3}

    If I replace Arry4 with the array of values …
    {“Due”;0;”Otto”;0;”Otto”;0;0;0;0;”Tre”;”Sei”;”Cinque”;0;”Quattro”;”Otto”;”Due”;0;0;”Quattro”;”Otto”}
    … then it resolves to:
    {1.2.3.2.3;2.2.2.2.10;11.12.2.14.3;1.2.2.14.3}

    . is for column
    ; is for row

  4. Hi. I’m not quite sure what you mean here. Can you explain how, when you “resolved”:

    MATCH(Range1,Arry4,0)

    you arrived at an array of:

    {1.#NV.3.#NV.3;#NV.#NV.#NV.#NV.10;11.12.#NV.14.3;1.#NV.#NV.14.3}

    I think the key question here is: exactly how did you resolve it?

    Regards

  5. Can you post the “resolution” of the formula:

    MATCH(Range1,Arry4,0)

    when you first resolve, using F9 within the formula bar, just the parts Range1 and Arry4?

    Doing this, I get:

    MATCH({"Due","","Otto","","Otto";"","","","","Tre";"Sei","Cinque","","Quattro","Otto";"Due","","","Quattro","Otto"},{"Due";"";"Otto";"";"Otto";"";"";"";"";"Tre";"Sei";"Cinque";"";"Quattro";"Otto";"Due";"";"";"Quattro";"Otto"},0)

    as I would expect. Further resolution (again using F9) gives:

    {1,2,3,2,3;2,2,2,2,10;11,12,2,14,3;1,2,2,14,3}

    Regards

  6. If I resolve both Range1 and Arry4 I get:
    =VERGLEICH({“Due”.0.”Otto”.0.”Otto”;0.0.0.0.”Tre”;”Sei”.”Cinque”.0.”Quattro”.”Otto”;”Due”.0.0.”Quattro”.”Otto”};{“Due”;0;”Otto”;0;”Otto”;0;0;0;0;”Tre”;”Sei”;”Cinque”;0;”Quattro”;”Otto”;”Due”;0;0;”Quattro”;”Otto”};0)

    . is for column
    ; is for row

  7. My problem is that the formula gives a different result for cell G1.
    So I tried to locate the problem.
    You are right. Where you have a “” I get a 0. And it’s doesn’t matter whether I use F9 or Evaluate Formula tool.
    When I replace the blank cells in Range1 with =”” my results are the same as yours.

  8. Sincere apologies. You were right. I was still using “” for the blanks for testing, which of course is a completely different matter.

    The correct formula should be:

    =IF(ROWS($1:1)>$H$1,"",INDEX(Arry4,MATCH(SMALL(IF(FREQUENCY(IF(Range1<>"",MATCH(Range1,Arry4,0)),Arry1),COUNTIF(Range1,"<"&Arry4)),ROWS($1:1)),IF(Arry4<>"",COUNTIF(Range1,"<"&Arry4)),0)))

    I will have to amend the post and analysis to reflect this important difference.

    And I can’t thank you enough for having brought it too my attention. And apologies once again for the confusion.

    Regards

  9. Hi XOR LX,

    Thanks for welcoming me here. Google Sheets is cool depending on what you wanna do. Excel is much better when it comes to print options, format options, pivot table options, and usually when you want more options haha. However, Google Sheets does have a lot of useful functions that Excel does not have, which may make some type of formulas a little shorter and perhaps a little easier to make after some practice. However, I know of a couple useful things you can’t do in Google Sheets, like 3D-references or combining the subtotal function with other functions to do conditional subtotaling (I am sure there is more since I do not know Excel very well).

  10. Hello XOR LX,

    Just found out about this very very interesting example … Really Great …!!!

    Is there a tweak which could allow to produce the same list … but sorted by descending Frequency ?

    Cheers

    James

  11. Hi XOR LX,

    The same formula works for a list of values in one column as well. Do you have any simpler formula to list down the unique values from one column or one row?

    Regards,

    Prasad DN

  12. =IF(ROWS($1:1)>$E$2,"",INDEX(Range1,MATCH(SMALL(IF(FREQUENCY(IF(Range1<>"",MATCH(Range1,Range1,0)),ROW(INDIRECT("1:"&ROWS(Range1)))),COUNTIF(Range1,"<"&Range1)),ROWS($1:1)),IF(Range1<>"",COUNTIF(Range1,"<"&Range1)),0)))

    Is the best I could get by replacing the defined names

  13. @ Prasad DN

    Sure!

    Using Range1 for your vertical source range (the formula would be different if instead your range was a horizontal one), the formula in H1 for the expected number of returns would this time be (CSE):

    =SUM(IF(FREQUENCY(IF(LEN(Range1),MATCH(Range1,Range1,0)),ROW(Range1)-MIN(ROW(Range1))+1),1))

    And the main formula in G1 (again, CSE):

    =IF(ROWS($1:1)>$H$1,"",INDEX(Range1,MATCH(SMALL(IF(FREQUENCY(IF(LEN(Range1),MATCH(Range1,Range1,0)),ROW(Range1)-MIN(ROW(Range1))+1),COUNTIF(Range1,"<="&Range1)),ROWS($1:1)),COUNTIF(Range1,"<="&Range1),0)))

    Copy down as required.

    Hope that helps!

    Cheers

  14. @Prasad DN

    Just seen your posted formula. Excellent construction! Well done! I’m very impressed.

    I would mention two small points. Firstly, it’s probably best to avoid the volatile INDIRECT set-up you use, as it’s not actually necessary here. See my construction, for example, for generating the bins_array for FREQUENCY.

    Secondly, you don’t actually need to repeat the conditional IF statement for MATCH’s lookup_array here (though you usually do, so I don’t blame you for including it again!).

    Great stuff! Thanks.

  15. Hi Again,

    I figured out one more method if data is single column:

    =INDEX(Range1,SUM(COUNTIF(Range1,$H$10:H10))+1)

    Note: H10 will have first data entered manually.

    and if data is single row, same formula with index(), skipping row_num:

    =INDEX(Range1,,SUM(COUNTIF(Range1,$H$10:H10))+1)

    Note: H10 will have first data entered manually.

    Regards,
    Prasad DN

  16. @Prasad DN

    I don’t understand your INDEX construction. How is that supposed to return a unique, alphabetical list?

    Can you give some examples of data in Range1 for which that works?

    Regards

  17. @Prasad DN

    So that solution requires both that the original list be sorted and that you manually enter the first value?

    Forgive me, but that does not sound like the most dynamic or flexible of solutions!

    Regards

  18. @Prasad DN

    Not sure I understand. If I use FREQENCY on a range of text values (as they are in this post), then of course I don’t get the desired results.

    Can you clarify? How does this new formula return the list starting in G1?

    Regards

  19. Pingback: Using the Excel TEXTJOIN Function To Return Unique Items In A One-Cell Delimited String From A 2D and 3D Range By David Hager | Excel For You

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s