[*][*]
[*]DAT 560M – Big Data and Cloud Computing 2023 – Homework #4
[*]DAT 560M: Big Data and Cloud Computing
[*]Fall 2023, Mini B
[*]Homework #4
[*]INSTRUCTIONS
- This is an individual assignment. You may not discuss your approach to solving these
[*]questions with anyone, other than the instructor or TA.
- Please include only your Student ID on the submission.
- The only allowed material is:
[*]a. Class notes
[*]b. Content posted on Canvas
[*]c. Utilize ONLY the codes we practice. Anything beyond will not get any point!
- You are not permitted to use other online resources.
- The physical submission is due by the next lab.
- There will be TA office hours. See the schedule on Canvas.
[*]ASSIGNMENT
[*]In this assignment, we are going to practice Spark on a file named loans.csv and the file is located
[*]in our database. In case you don’t have the file, you can get it from the dataset folder on the server.
[*]http://server-ip/dataset/loans.csv
[*]This dataset has information about loans distributed to poor and financially excluded people
[*]around the world by a company called Kiva. There are a few number of columns in the dataset
[*]and we would like to do an analysis on that by pyspark. Please answer each question and provide
[*]a screenshot.
[*]Part 1- Initialize Spark (5 pts)
[*]1- Start the PySpark engine and load the file. This homework is a little bit complex and its
[*]better that we assign more resources. Then, when assigning your engine, you can assign
[*]all available CPU cores on your machine to the Spark to perform faster. To do that, just
[*]simply put local[*] instead of local (look at the following screenshot). If it crashes or
[*]doesn’t work properly, you are more than welcome to go back to the normal initialization
[*]process. (2 pts)
[*]DAT 560M – Big Data and Cloud Computing 2023 – Homework #4
[*]2- Get to know the dataset and do a preliminary examination (for example type of columns,
[*]summary, …) (2 pts)
[*]3- Here, we have two identifier for the country of the loan receiver, country, and
[*]country_code and so one is enough. Then please drop country_code. (1 pts)
[*]Part 2- Data analysis (50 pts)
[*]4- Find the three most loan awarded sector when the loan amount is larger than 1000. (5 pts)
[*]5- For the top sector you found in Q4, list 6 most used activities. (5 pts)
[*]6- Find the number of given loans per year. For that, use the year from posted_time. You
[*]may add a new column called “year”. (5 pts)
[*]7- Using SQL syntax, list the number of loans per sector in decreasing order where the
[*]countries are the 3 top countries in terms of the number of received loans. (10 pts)
[*]8- Find the top 20 countries in terms of the total loan amount they have received where the
[*]use of the loan include the word “stock”. You may use SQL. (5 pts)
[*]9- Do a wordcount on the “use” column. For that, consider all lower case. If you can, it’s
[*]great to remove stopwords and then do the wordcount. It’s OK if you don’t know how to
[*]do so. (10 pts)
[*]10- Group the loans into 5 categories. If the loan amount is equal or larger than 50000, call it
[*]“super large”. If less but larger or equal to 10000, call it “large”. If less but larger or
[*]equal to 5000, call it “medium”. If less but larger or equal to 1000, call it “small”. If less,
[*]call it “tiny”. Then, find the number of given loans to each category per gender. For
[*]gender, only consider “male” or “female”. (10 pts)
[*]Part 3- Feature engineering (10 pts)
[*]11- Let’s find how many people are involved in each loan application. To find it out, look at
[*]gender column. You can see sometimes it is one value, and sometimes more than one.
[*]Count it for each loan and add it to the dataframe. (10 pts)
[*]DAT 560M – Big Data and Cloud Computing 2023 – Homework #4
[*]Part 4- Machine learning (35 pts)
[*]12- Now let’s focus only on Retail, Agriculture, and Food sectors the remove the rest of the
[*]rows (10 pts).
[*]13- We like to predict the loan_amount based on sector, country, term_in_months, year, and
[*]the new attribute you added in Q11 and drop the rest of the columns. (5 pts)
[*]14- Prepare your data to do a prediction task. We are interested in predicting the loan