Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Jupyter Notebooks & Data Analyt...

Introduction to Jupyter Notebooks & Data Analytics with Kaggle

Workshop given on Pyladies Dublin

Avatar for Leticia Portella

Leticia Portella

February 19, 2019
Tweet

More Decks by Leticia Portella

Other Decks in Technology

Transcript

  1. Kaggle is a place where you can find a lot

    of datasets, it already have installed most of tools you’ll need for a basic analysis, is a good place to see the people’s code and built a portfolio Why Kaggle?
  2. Notebooks are a place where you can create code, show

    graphs, document your methodologies and findings… all in a single place
  3. Jupyter Shortcuts Ctrl + Enter = Run cell ESC +

    B = New cell below ESC + dd = Delete cell
  4. Reading a document If you check the first cell, it

    will tell you that the documents are ready for you in ../input/. So, we can read the files by with a Pandas function and with the path of the file df = pd.read_csv(‘../input/train.csv')
  5. Dataframes Dataframes are similiar to what you find in Excel

    structures. You have rows indicated by numbers and columns with names. You can check the first 5 rows of a data frame to see the basic structure: df.head()
  6. Dataframes You can check the structure of a dataframe, to

    get an idea of how many rows and columns it has: df.shape
  7. Dataframes You can check the main statistical characteristics of the

    numerical columns of a data frame df.describe()
  8. Series You can select a single column of the data

    frame to work with. A column of a Dataframe is called Series and have some special properties df['Age']
  9. Series You can filter a series to see which rows

    have adults. This will return a Series of True and False. df[‘Age'] > 10
  10. Series And Series have functions that help you quickly plot

    some of it. We can, for instance, check the histogram of Ages. df[‘Age’].plot.hist()
  11. Series And Series have functions that help you quickly plot

    some of it. We can, for instance, check the histogram of Ages. df[‘Age’].plot.hist()
  12. Series We can count how many passengers were on each

    class df[‘Pclass’].value_counts()
  13. Series Since the result of a value_counts is also a

    Series, we can store this value in a variable and use it to plot a pie chart :) passengers_per_class = df[‘Pclass’].value_counts() passengers_per_class.plot.pie()
  14. Exercise Plot a bar plot with the number of people

    that survived and didn’t survive (Column Survived)
  15. Series Remember we could filter a series? We could use

    it to checkout our variables. Let’s see which class survived the most survived = df[‘Survived'] > 0 filtered_df = df[survived] passenger_per_class = filtered_df[“Pclass”].value_counts() passenger_per_class.plot.pie()
  16. Series We can create a new column (Ageclass) using the

    Column Age and this function :) df[“Ageclass”] = df[“Age”].apply(age_to_ageclass)
  17. Exercise Now we have classes for age, we can check

    which sector survived the most, the same we did with Class :)