git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

OT: Is there a name for this transformation?


On 2019-07-10 08:57:29 -0400, kamaraju kusumanchi wrote:
> Given a csv file with the following contents
> 
> 20180701, A
> 20180702, A, B
> 20180703, A, B, C
> 20180704, B, C
> 20180705, C
> 
> I would like to transform the underlying data into a dataframe such as
> 
>     date,     A,     B,     C
> 20180701,  True, False, False
> 20180702,  True,  True, False
> 20180703,  True,  True,  True
> 20180704, False,  True,  True
> 20180705, False, False,  True
> 
> the idea is that the first field in each line of the csv is the row
> index of the dataframe. The subsequent fields will be its column names
> and the values in the dataframe tell whether that element is present
> or not in the line.
> 
> Is there a name for this transformation?

This type of output is usually called a cross table, but I don't know
whether this specific transformation has a name (if you had only one of
A, B, and C per line it would be a kind of pivot operation).

> Any existing code/library
> that can transform data back and forth between the two formats? I can
> write one myself if there is none but trying to avoid reinventing the
> wheel if possible.

I need to produce cross tables frequently, but I never bothered to make
it into the library because the part that is common (maintaining two
hashes and dumping them) is so much less than the parts which are
different (data source and format, what information to extract, output
format).

The basic idea is that you use a dict of dict of (whatever) to represent
your output matrix: row keys are the first level, column keys are the
second level. Cell type in your case is bool, so you could use a set
instead of a dict of bool. Often you want to keep information about each
row (e.g. order of appearance, or a count), so you'll use a second dict
for that. For output you get a list of columns in the right order and
then iterate over the 1st level keys of your dict and the list of
columns to access each cell.

        hp


-- 
   _  | Peter J. Holzer    | we build much bigger, better disasters now
|_|_) |                    | because we have much more sophisticated
| |   | hjp at hjp.at         | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20190710/ee4311bf/attachment.sig>