Software

3 minute read

How working/install Pig with Notebooks?

Christina Kasper

January 27, 2023

🐷📝 Basic commands to work with Pig in Notebooks

🔗Related content

📀Google Colab

🐱‍🏍GitHub

You can connect with me in:

🧬LinkedIn

Resume 🧾

I will install Hadoop with Pig program and will use a library of Python to write a job that answer the question, how many row exists by each rating?

First I install Hadoop using same commands that I have used before but without put a number of step.

Install Hadoop 🐘

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/hadoop/common/hadoop-3.3.4/hadoop-3.3.4.tar.gz

You would can get other version if you need in: https://downloads.apache.org/hadoop/common/ and later replace it in the before command.

Unzip and copy 🔓

I use following command:

!tar -xzvf hadoop-3.3.4.tar.gz && cp -r hadoop-3.3.4/ /usr/local/

Set up Hadoop’s Java ☕

I use following command:

#To find the default Java path and add export in hadoop-env.sh
JAVA_HOME = !readlink -f /usr/bin/java | sed "s:bin/java::"
java_home_text = JAVA_HOME[0]
java_home_text_command = f"$ {JAVA_HOME[0]} "
!echo export JAVA_HOME=$java_home_text >>/usr/local/hadoop-3.3.4/etc/hadoop/hadoop-env.sh

Set Hadoop home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['HADOOP_HOME']="https://dev.to/usr/local/hadoop-3.3.4"
os.environ['JAVA_HOME']=java_home_text

1st – Install Pig 🐷

I use following command but you can change to get current last version:

!wget https://downloads.apache.org/pig/pig-0.17.0/pig-0.17.0.tar.gz

You would can get other version if you need in: https://downloads.apache.org/pig/ and later replace it in the before command.

2nd – Unzip and copy 🔓

I use following command:

!tar -xzvf pig-0.17.0.tar.gz

3rd – Set Pig home variables 🏡

I use following command:

# Set environment variables
import os
os.environ['PIG_HOME']="https://dev.to/content/pig-0.17.0"
os.environ['PIG_CLASSPATH']="https://dev.to/usr/local/hadoop-3.3.1/conf"
os.environ["PATH"] += os.pathsep + "https://dev.to/content/pig-0.17.0/bin"

We can validate installation with command:

!pig -version

4th – Create a folder with HDFS 🌎📂

I use following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -mkdir file:///content/data_pig

4.1 – Remove folder with HDFS ♻

Maybe, later you need remove it. To do that you must apply following command:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -rm -r file:///content/data_pig

5th – Getting a dataset to anlyze with Pig 💾

I use a dataset from grouplens. You can get other in:
http://files.grouplens.org/datasets/

This time I use movieslens and you can download it using:

!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

To use data extract files. I extract files in path later of -d in command:

!unzip "https://dev.to/content/ml-100k.zip" -d "file:///content/data_pig"

For list them:

!/usr/local/hadoop-3.3.4/bin/hadoop fs -ls /content/data_pig/ml-100k

6th – Creating process to use Pig with Pig Syntax 🐖

To create job in Pig, you must see structure of dataset to configure jobs.
In this case we print dataset with following command:

!head /content/data_pig/ml-100k/u.data

I can get following information of dataset:

First column reference to userID.
Second column reference to movieID.
Third column reference to rating.
Fourth column reference to timestamp.

# Create pig script
%%writefile id.pig
/* id.pig */

student = LOAD 'file:///content/data_pig/ml-100k/u.data' USING PigStorage(' ')
   as (userId:int, movieId:int, rating:int, timestamp:int);

student_order = ORDER student BY rating DESC;

Dump student_order;

7th – Running the process 🙈

Here we run the process specifing some parameters:

Pig file program is id.pig
Dataset is in file:///content/data_pig/ml-100k/u.data

When run process, maybe take a few minutes…

You can run script with:

!pig -x local id.pig

But we run script and save results in a file .txt:

!pig -x local id.pig > results.txt

8th – Advancing in the logic of the scripts 😎

Now we will advance in logic of the script to get answer to next questions:

What are the oldest 5 star movies?
What are the worst movies?

8.1 – Find oldest 5 star movies start ⭐

%%writefile fiveStarMovies.pig

ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
    AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);

nameLookup = FOREACH metadata GENERATE movieID, movieTitle, ToUnixTime(ToDate(releaseDate, 'dd-MMM-yyyy')) AS releaseTime;

ratingsByMovie = GROUP ratings BY movieID;

avgRatings = FOREACH ratingsByMovie GENERATE group as movieID, AVG(ratings.rating) as avgRating;

fiveStarMovies = FILTER avgRatings BY avgRating > 4.0;

fiveStarsWithData = JOIN fiveStarMovies BY movieID, nameLookup BY movieID;

oldestFiveStarMovies = ORDER fiveStarsWithData BY nameLookup::releaseTime;

DUMP oldestFiveStarMovies;

Run script and save results in a file .txt:

!pig -x local fiveStarMovies.pig > fiveStarMovies.txt

8.2 – Find most rated bad movies ⭐

%%writefile BadPopularMovies.pig

ratings = LOAD 'file:///content/data_pig/ml-100k/u.data'
  AS (userID:int, movieID:int, rating:int, ratingTime:int);
metadata = LOAD 'file:///content/data_pig/ml-100k/u.item' USING PigStorage('|')
    AS (movieID:int, movieTitle:chararray, releaseDate:chararray, videoRealese:chararray, imdblink:chararray);

nameLookup = FOREACH metadata GENERATE movieID, movieTitle;

groupedRating = GROUP ratings by movieID;

avgRatings = FOREACH groupedRating GENERATE group as movieID, AVG(ratings.rating) as avgRating, COUNT(ratings.rating) AS numRatings;  

badMovies = FILTER avgRatings BY avgRating < 2.0;

namedBadMovies = JOIN badMovies BY movieID, nameLookup BY movieID;

results = FOREACH namedBadMovies GENERATE nameLookup::movieTitle as movieName,
          badMovies::avgRating as avgRating, badMovies::numRatings as numRatings;

finalResults = ORDER results BY numRatings DESC;

DUMP finalResults;

Run script and save results in a file .txt:

!pig -x local BadPopularMovies.pig > BadPopularMovies.txt

How To Rebrand Your Business: 9 Smart Steps

January 26, 2023

Software

Will ChatGPT replace Developer’s Job?

January 27, 2023

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

Airbnb says a third of its customer support is now handled by AI in the U.S. and Canada

VIDEO PODCAST | Heeding the Uncertainty of the Supply Chain

cfix: Architecting a seamless diagnostic bridge between Linux runtime errors and GitHub Copilot’s LLM-powered intelligence

Trending Tags

How working/install Pig with Notebooks?

🐷📝 Basic commands to work with Pig in Notebooks

🔗Related content

You can connect with me in:

Resume 🧾

Install Hadoop 🐘

Unzip and copy 🔓

Set up Hadoop’s Java ☕

Set Hadoop home variables 🏡

1st – Install Pig 🐷

2nd – Unzip and copy 🔓

3rd – Set Pig home variables 🏡

4th – Create a folder with HDFS 🌎📂

4.1 – Remove folder with HDFS ♻

5th – Getting a dataset to anlyze with Pig 💾

6th – Creating process to use Pig with Pig Syntax 🐖

7th – Running the process 🙈

8th – Advancing in the logic of the scripts 😎

8.1 – Find oldest 5 star movies start ⭐

8.2 – Find most rated bad movies ⭐

Leave a Reply Cancel reply

Previous Post

How To Rebrand Your Business: 9 Smart Steps

Next Post

Will ChatGPT replace Developer’s Job?

How working/install Pig with Notebooks?

🐷📝 Basic commands to work with Pig in Notebooks

🔗Related content

You can find post related in:

You can find repo related in:

You can connect with me in:

Resume 🧾

Install Hadoop 🐘

Unzip and copy 🔓

Set up Hadoop’s Java ☕

Set Hadoop home variables 🏡

1st – Install Pig 🐷

2nd – Unzip and copy 🔓

3rd – Set Pig home variables 🏡

4th – Create a folder with HDFS 🌎📂

4.1 – Remove folder with HDFS ♻

5th – Getting a dataset to anlyze with Pig 💾

6th – Creating process to use Pig with Pig Syntax 🐖

7th – Running the process 🙈

8th – Advancing in the logic of the scripts 😎

8.1 – Find oldest 5 star movies start ⭐

8.2 – Find most rated bad movies ⭐

9th - Say thanks, give like and share if this has been of help/interest 😁🖖

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts