How working/install Pig with Notebooks?

how-working/install-pig-with-notebooks?

🐷📝 Basic commands to work with Pig in Notebooks

🔗Related content

You can find post related in:

📀Google Colab

🐱‍🏍GitHub

You can connect with me in:

🧬LinkedIn

Resume 🧾

I will install Hadoop with Pig program and will use a library of Python to write a job that answer the question, how many row exists by each rating?

First I install Hadoop using same commands that I have used before but without put a number of step.

Install Hadoop 🐘

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.

Unzip and copy 🔓

I use following command:

!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/

Set up Hadoop’s Java ☕

I use following command:

#To find the default Java path and add export in hadoop-env.sh
JAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"
java_home_text = JAVA_HOME[0]
java_home_text_command = f"$ {JAVA_HOME[0]} "
!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh

Set Hadoop home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['HADOOP_HOME']="https://dev.to/usr/local/hadoop-3.3.4"
os.environ['JAVA_HOME']=java_home_text

1st – Install Pig 🐷

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz

You would can get other version if you need in: https://downloads.apache.org/pig/ and later replace it in the before command.

2nd – Unzip and copy 🔓

I use following command:

!tar -xzvf pig-0.17.0.tar.gz

3rd – Set Pig home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['PIG_HOME']="https://dev.to/content/pig-0.17.0"
os.environ['PIG_CLASSPATH']="https://dev.to/usr/local/hadoop-3.3.1/conf"
os.environ["PATH"] += os.pathsep + "https://dev.to/content/pig-0.17.0/bin"

We can validate installation with command:

!pig -version

4th – Create a folder with HDFS 🌎📂

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///content/data_pig

4.1 – Remove folder with HDFS ♻

Maybe, later you need remove it. To do that you must apply following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r file:///content/data_pig

5th – Getting a dataset to anlyze with Pig 💾

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "https://dev.to/content/ml-100k.zip" -d "file:///content/data_pig"

For list them:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls /content/data_pig/ml-100k

6th – Creating process to use Pig with Pig Syntax 🐖

To create job in Pig, you must see structure of dataset to configure jobs.
In this case we print dataset with following command:

!head /content/data_pig/ml-100k/u.data

I can get following information of dataset:

  • First column reference to userID.
  • Second column reference to movieID.
  • Third column reference to rating.
  • Fourth column reference to timestamp.
# Create pig script
%%writefile id.pig
/* id.pig */

student = LOAD 'file:///content/data_pig/ml-100k/u.data' USING PigStorage(' ')
   as (userId:int, movieId:int, rating:int, timestamp:int);

student_order = ORDER student BY rating DESC;

Dump student_order;

7th – Running the process 🙈

Here we run the process specifing some parameters:

  • Pig file program is id.pig
  • Dataset is in file:///content/data_pig/ml-100k/u.data

When run process, maybe take a few minutes…

You can run script with:

!pig -x local id.pig

But we run script and save results in a file .txt:

!pig -x local id.pig > results.txt

8th – Advancing in the logic of the scripts 😎

Now we will advance in logic of the script to get answer to next questions:

  • What are the oldest 5 star movies?
  • What are the worst movies?

8.1 – Find oldest 5 star movies start ⭐

%%writefile fiveStarMovies.pig

ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
    AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);

nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;

ratingsByMovie = GROUP ratings BY movieID;

avgRatings = FOREACH ratingsByMovie GENERATE group as movieID, AVG(ratings.rating) as avgRating;

fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;

fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;

oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime;

DUMP oldestFiveStarMovies;

Run script and save results in a file .txt:

!pig -x local fiveStarMovies.pig > fiveStarMovies.txt

8.2 – Find most rated bad movies ⭐

%%writefile BadPopularMovies.pig

ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
  AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);

nameLookup = FOREACH metadata GENERATE movieID, movieTitle;

groupedRating = GROUP ratings by movieID;

avgRatings = FOREACH groupedRating GENERATE group as movieID, AVG(ratings.rating) as avgRating, COUNT(ratings.rating) AS numRatings;  

badMovies = FILTER avgRatings BY avgRating < 2.0;

namedBadMovies = JOIN badMovies BY movieID, nameLookup BY movieID;

results = FOREACH namedBadMovies GENERATE nameLookup::movieTitle as movieName,
          badMovies::avgRating as avgRating, badMovies::numRatings as numRatings;

finalResults = ORDER results BY numRatings DESC;

DUMP finalResults;

Run script and save results in a file .txt:

!pig -x local BadPopularMovies.pig > BadPopularMovies.txt

9th - Say thanks, give like and share if this has been of help/interest 😁🖖

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
how-to-rebrand-your-business:-9-smart-steps

How To Rebrand Your Business: 9 Smart Steps

Next Post
will-chatgpt-replace-developer’s-job?

Will ChatGPT replace Developer’s Job?

Related Posts