Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A minor error on page 258 and ch08.ipynb (Training a logistic regression model for document classification) when preparing train and test datasets #139

Open
pavlo-yanchenko opened this issue Aug 14, 2023 · 1 comment

Comments

@pavlo-yanchenko
Copy link

When we prepare the train and test datasets, we slice the IMDB dataset dataframe with the .loc method (slicing using the index).

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

It's worth noting that contrary to usual Python slices, .loc includes both the start and the stop points in the result (when they are present in the index). So, it ends up with having the sample #25000 in both train and test datasets.

@rasbt
Copy link
Owner

rasbt commented Aug 14, 2023

Great point. I think it's best to switch to .iloc here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants