git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]# OT: Is there a name for this transformation?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

On Wed, Jul 10, 2019 at 3:08 PM Peter J. Holzer <hjp-python at hjp.at> wrote: > > On 2019-07-10 08:57:29 -0400, kamaraju kusumanchi wrote: > > Given a csv file with the following contents > > > > 20180701, A > > 20180702, A, B > > 20180703, A, B, C > > 20180704, B, C > > 20180705, C > > > > I would like to transform the underlying data into a dataframe such as > > > > date, A, B, C > > 20180701, True, False, False > > 20180702, True, True, False > > 20180703, True, True, True > > 20180704, False, True, True > > 20180705, False, False, True > > > > the idea is that the first field in each line of the csv is the row > > index of the dataframe. The subsequent fields will be its column names > > and the values in the dataframe tell whether that element is present > > or not in the line. > > > > Is there a name for this transformation? > > This type of output is usually called a cross table, but I don't know > whether this specific transformation has a name (if you had only one of > A, B, and C per line it would be a kind of pivot operation). Thanks for telling me about cross table. I found out about cross-tabulation functionality in Pandas using pandas.crosstab() which is described in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html As for my original problem, I solved it as follows: $cat data.csv 20180701, A 20180702, A, B 20180703, A, B, C 20180704, B, C 20180705, C import pandas as pd import numpy as np # expand the data into two numpy arrays such as # a = np.array(['20180701', '20180702', '20180702', '20180703', '20180703', '20180703', '20180704', '20180704', '20180705']) # b = np.array(['A', 'A', 'B', 'A', 'B', 'C', 'B', 'C', 'C']) rows = [] cols = [] with open('data.csv') as fo: for line in fo: line = line.strip() elem = line.split(',') N = len(elem) rows += elem[0:1] * (N-1) cols += elem[1:] a = np.array(rows) b = np.array(cols) df = pd.crosstab(a, b, rownames=['date']).astype('bool').reset_index() which gives print(df) col_0 date A B C 0 20180701 True False False 1 20180702 True True False 2 20180703 True True True 3 20180704 False True True 4 20180705 False False True -- Kamaraju S Kusumanchi | http://raju.shoutwiki.com/wiki/Blog

- Prev by Date:
**Counting Python threads vs C/C++ threads** - Next by Date:
**Embedding Python in C** - Previous by thread:
**OT: Is there a name for this transformation?** - Next by thread:
**load extention spatialite in sqlite on windows** - Index(es):