Pandas Basic
Version 1.0.3
Pandas basics¶
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline
from grader import Grader
DATA_FOLDER = '../readonly/final_project_data/'
transactions = pd.read_csv(os.path.join(DATA_FOLDER, 'sales_train.csv.gz'))
items = pd.read_csv(os.path.join(DATA_FOLDER, 'items.csv'))
item_categories = pd.read_csv(os.path.join(DATA_FOLDER, 'item_categories.csv'))
shops = pd.read_csv(os.path.join(DATA_FOLDER, 'shops.csv'))
The dataset we are going to use is taken from the competition, that serves as the final project for this course. You can find complete data description at the competition web page. To join the competition use this link.
Grading¶
We will create a grader instace below and use it to collect your answers. When function submit_tag
is called, grader will store your answer locally. The answers will not be submited to the platform immediately so you can call submit_tag
function as many times as you need.
When you are ready to push your answers to the platform you should fill your credentials and run submit
function in the last paragraph of the assignment.
grader = Grader()
Task¶
Let's start with a simple task.
- Print the shape of the loaded dataframes and use [`df.head`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) function to print several rows. Examine the features you are given.
# YOUR CODE GOES HERE
Now use your pandas
skills to get answers for the following questions. The first question is:
- What was the maximum total revenue among all the shops in September, 2014?
- Hereinafter revenue refers to total sales minus value of goods returned.
Hints:
- Sometimes items are returned, find such examples in the dataset.
- It is handy to split
date
field into [day
,month
,year
] components and usedf.year == 14
anddf.month == 9
in order to select target subset of dates. - You may work with
date
feature as with strings, or you may first convert it topd.datetime
type withpd.to_datetime
function, but do not forget to set correctformat
argument.
# YOUR CODE GOES HERE
max_revenue = # PUT YOUR ANSWER IN THIS VARIABLE
grader.submit_tag('max_revenue', max_revenue)
Great! Let's move on and answer another question:
- What item category generated the highest revenue in summer 2014?
-
Submit
id
of the category found. -
Here we call "summer" the period from June to August.
Hints:
- Note, that for an object
x
of typepd.Series
:x.argmax()
returns index of the maximum element.pd.Series
can have non-trivial index (not[1, 2, 3, ... ]
).
# YOUR CODE GOES HERE
category_id_with_max_revenue = # PUT YOUR ANSWER IN THIS VARIABLE
grader.submit_tag('category_id_with_max_revenue', category_id_with_max_revenue)
- How many items are there, such that their price stays constant (to the best of our knowledge) during the whole period of time?
- Let's assume, that the items are returned for the same price as they had been sold.
# YOUR CODE GOES HERE
num_items_constant_price = # PUT YOUR ANSWER IN THIS VARIABLE
grader.submit_tag('num_items_constant_price', num_items_constant_price)
Remember, the data can sometimes be noisy.
- What was the variance of the number of sold items per day sequence for the shop with `shop_id = 25` in December, 2014? Do not count the items, that were sold but returned back later.
- Fill
total_num_items_sold
anddays
arrays, and plot the sequence with the code below. - Then compute variance. Remember, there can be differences in how you normalize variance (biased or unbiased estimate, see link). Compute unbiased estimate (use the right value for
ddof
argument inpd.var
ornp.var
). - If there were no sales at a given day, do not impute missing value with zero, just ignore that day
shop_id = 25
total_num_items_sold = # YOUR CODE GOES HERE
days = # YOUR CODE GOES HERE
# Plot it
plt.plot(days, total_num_items_sold)
plt.ylabel('Num items')
plt.xlabel('Day')
plt.title("Daily revenue for shop_id = 25")
plt.show()
total_num_items_sold_var = # PUT YOUR ANSWER IN THIS VARIABLE
grader.submit_tag('total_num_items_sold_var', total_num_items_sold_var)
Authorization & Submission¶
To submit assignment to Cousera platform, please, enter your e-mail and token into the variables below. You can generate token on the programming assignment page. Note: Token expires 30 minutes after generation.
STUDENT_EMAIL = # EMAIL HERE
STUDENT_TOKEN = # TOKEN HERE
grader.status()
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)
Well done! :)
'Computer Engineering' 카테고리의 다른 글
YARN Cluster에 연결된 Spark로 pyspark - Oracle 데이터베이스 연결하기 (0) | 2023.06.08 |
---|---|
Spark Deploy Mode의 종류 (0) | 2023.06.07 |
[Azure Certi] AZ-900 Certi 준비 (10) - Azure 비용 예상 및 지출 최적화 (0) | 2020.05.27 |
컴퓨터의 성능을 높이는 방법 - 스케일업(scale up)과 스케일 아웃(scale out) (0) | 2020.05.12 |
OSI 참조모델에 대하여 : OSI(Open System Interconnection) (0) | 2020.05.09 |
댓글
이 글 공유하기
다른 글
-
YARN Cluster에 연결된 Spark로 pyspark - Oracle 데이터베이스 연결하기
YARN Cluster에 연결된 Spark로 pyspark - Oracle 데이터베이스 연결하기
2023.06.08 -
Spark Deploy Mode의 종류
Spark Deploy Mode의 종류
2023.06.07 -
[Azure Certi] AZ-900 Certi 준비 (10) - Azure 비용 예상 및 지출 최적화
[Azure Certi] AZ-900 Certi 준비 (10) - Azure 비용 예상 및 지출 최적화
2020.05.27 -
컴퓨터의 성능을 높이는 방법 - 스케일업(scale up)과 스케일 아웃(scale out)
컴퓨터의 성능을 높이는 방법 - 스케일업(scale up)과 스케일 아웃(scale out)
2020.05.12