Given a two-dimensional array, potentially containing some empty cells, it is sometimes desirable to create a list of all non-blank entries from that array in a single column.
In general, it is not a major concern in which order the returns appear in this new column, and indeed the “standard” solution for this problem is the one given here, in which those returns are listed in an order which is consistent with the entries from an entire row from the original array being returned prior to moving onto those in the next row. The converse, in which entries are returned in a columns-first fashion, will be the subject of my first Advanced Formula Challenge post to follow this one.
For example, given the the 6-by-4 array on the left we may wish to create the single-column list on the right:
This can be done as follows:
Assuming that the original range (A1:D6 in the above example) is stored as the Named Range Range1, first we enter this formula in a cell somewhere, e.g. F1:
This is simply a single calculation which determines the number of non-blank entries in Range1, and which will be used in the main set of formulas so that we can copy them down an arbitrarily large number of rows without having to worry about any potential performance issues (such would be the case if we were to use a “lazy” IFERROR approach).
The array formula in G1 is then:
and copied down until we start to get blanks for the results.
How does it work?
The initial IF clause is straightforward enough: ROWS($1:1) is equal to 1 and, when copied down to further rows becomes, successively, ROWS($1:2) (=2), ROWS($1:3) (=3), ROWS($1:4) (=4), etc. (See here for a discussion on the merits of ROWS over ROW.)
Hence, we are simply checking, for each row to which the formula is copied, whether the relative row number in each case is greater than the value in F1, which if you recall is our calculation of the number of non-blank cells in Range1. If this is TRUE, then we return a blank and the large array formula is not even considered for calculation.
The reason for this approach with a single “helper” cell here in F1 is that the alternative (and often-seen) set-up which does without such a calculation and instead uses IFERROR, i.e.:
although perfectly acceptable, suffers from the drawback that, assuming the formula is copied down a large number of rows (and supposing we have a dynamic, changing source data, that is precisely what we want to be able to do the very first time we enter these formulas, rather than have to constantly re-adjust the number of rows containing them) then even for those rows which are beyond the expected number of returns the large, relatively resource-heavy array part of the formula will still be forced to calculate needlessly, the reason being that the “” will not be returned in those cells until the IFERROR has first gone through that large array part and determined whether it does indeed resolve to an error.
The method I demonstrate here, using the calculation in F1, has the benefit that, for each formula in which the IF clause returns TRUE, the formula stops, returns a blank, and doesn’t even begin to go through the heavy stuff. This is the nice thing about the IF formula – if the first clause is TRUE, the FALSE part does not get calculated.
Moving on to the main part of the formula, let’s look at this clause first:
Expanding this out, we will have an array of Boolean TRUE/FALSE responses as to whether each cell in Range1 is blank or not:
The values to be returned, the rather strange-looking construction involving 10^5 multiplied by the row number followed by adding on the column number for each of the cells in Range1, will resolve to:
The two together will thus produce the following array:
This array is then passed to the SMALL function, which has for its k parameter the construction using ROWS that we saw earlier in the initial IF clause, so we know that, in the first row in which this formula is placed, this part will be:
which is equal to 100001, and in the second row will be:
which is equal to 100002, and so on.
Now comes the clever bit. Taking one of these returns from the SMALL function at random – let’s say that from the sixth formula after copying down, which will be 200004 (as you can easily verify) – our relevant part of the formula will look like this:
Resolving the TEXT function gives:
So what’s happening here? Many of you may be familiar with the use of INDIRECT, whose role is to interpret a text string as an actual range reference, though many of you might not have known that this function actually has a second (optional) parameter which can be either TRUE or FALSE (or 1 or 0, as usual).
Setting this parameter to 0 instructs INDIRECT to interpret the text string as an R1C1-type reference, and not the customary A1-type. Although Excel has an option in its settings so that, if we choose, we can work with these types of references in all our formulas, I would imagine that the vast majority of users prefer to stick with the A1-style of referencing cells.
Without going into too much detail, cell A1 in R1C1 format is R1C1, cell A2 is R2C1, cell B1 is R1C2, etc., etc.
Another useful feature here is that Excel is capable of ignoring any leading zeroes in such references, so that, for example, passing “R0001C01” (the number of zeroes for “R” and “C” do not have to match) to INDIRECT also produces a valid reference to cell A1.
With this in mind, we can easily see that:
(two rows down, four columns across, if you like) will resolve to a reference to the contents of cell D2, i.e. “I”, as desired.
So why did we choose 10^5 as our multiplier for the row numbers? And why “R0C00000” precisely? Why not e.g. “R00000C00000”, or “R0C0”?
Well, although it’s almost unheard of to see a worksheet which contains data in row 500,000, or column XFD, for example, we should nevertheless account for that possibility. Imagine for a moment that we chose instead to multiply each row number by 10^3 and then add on the column number. We would set the format_text part of the TEXT function this time to “R0C000” (the reason for which I’ll explain next).
Now let’s say our first return happens to be in ALK1 (i.e. row 1, column 999). Then:
resolves to R1C999, which is fine, and just what we expected. However, now make the cell in question one further to the right, i.e. ALL1 (or row 1, column 1000). Then:
is not R1C1000, as we would require, but actually R2C000.
(Furthermore, attempting to pass this value to INDIRECT would cause a #REF! error, since there is no column 0 in a worksheet.)
Now replace 10^3 in the above with 10^5, and change “R0C000” to “R0C00000”, i.e.:
The result? R1C01000, which, once Excel has helpfully ignored the leading zeroes, will be equivalent to R1C1000, as required.
The problem with the first version is that:
is equal to:
and you can see that we have completely lost the ability to extract our desired 1 and 1000 here.
The point of 10^5, then, is to make sure that we multiply the row number by large enough a value such that, when we come to add on the column number, whatever that may be, we ensure that “units” stay separate and do not get added together, as in the above example.
Since the maximum column in Excel (at least currently) is XFD, i.e. 16384, providing the part generated by our multiplication of the row number and some other value is at least 100000 (=10^5), then we are safe in the knowledge that our row and column numbers will stay intact within this resultant figure, since, even in the worst-case scenario with row number equal to 1 and column number equal to 16,384 we would have:
(You might like to check again what would happen if we used 10^3 here: result: 17384, not at all what we would want.)
The reason why we can get away with just a single zero for the row part of the TEXT representation, i.e. “R0C000000” and are not required to use e.g. “R00000C000000”, is that this function here operates with a precedence for dealing with trailing zeroes first; everything else not accounted for is then placed before the “C”. The fact that there is only one zero there is irrelevant (we could use more to no ill-effect if we wanted of course, though why do so?).
For example, in the case:
five values are first allocated to fill the five trailing zeroes after “C”; what’s left is then passed to the “R”, i.e.:
I leave it to the reader to investigate further some of the ideas developed in this post. And hopefully some may also like to tackle the challenge to be laid out in part 2!