Selecting rows from an HDFStore by index using where
I have a data frame with user_ids stored as an indexed frame_table in an
HDFStore. Also in this HDF file is another table with actions the user
took. I want to grab all of the actions taken by 1% of the users. The
procedure is as follows:
#Get 1% of the user IDs
df_id = store.select('df_user_id', columns = ['id'])
1pct_users = rnd.sample(df_id.id.unique(), 0.01*len(df_id.id.unique()))
df_id = df_id[df_id.id.isin(1pct_users)]
Now I want to go back and get all of the additional info that describes
the actions taken by these users from frame_tables identically indexed as
df_user_id. As per this example and this question I have done the
following:
1pct_actions = store.select('df_actions', where = pd.Term('index',
1pct_users.index))
This simply provides an empty data frame. In fact, if I copy and paste the
example in the previous pandas doc link I also get an empty data frame.
Did something change about Term in recent pandas? I'm on pandas 0.12.
I'm not tied to any particular solution. As long as I can get hdfstore
indices from a lookup on the df_id table (which is fast) and then directly
pull those indices from the other frame tables.
No comments:
Post a Comment