Vipul Pathak

Thursday, May 28, 2015

Sample Pig Statements Demonstrating Many Functions

Filed under: Big Data,Hadoop,Pig,Programming,Technical — Vipul Pathak @ 21:47
Tags: , , , , ,


Pig Statements

-- Load command loads the data
-- Every placeholder like "A_Rel" and "Filter_A" are called Alias, and they are useful
-- in holding the relation returned by pig statements. Aliases are relations (not variables).
A_Rel = LOAD '/hdfs/path/to/file' [AS (col_1[: type], col_2[: type], col_3[: type], ...)] ;

-- Record set returned by a command is called "Relation"
-- Filter command applies a filter condition on the Data Set / Record Set (Relation)
-- A filter can be viewed as a WHERE clause by SQL enthusiasts.
-- Assignment Operator is =    Equality Comparison Operator is ==
Filter_A = FILTER A_Rel BY (col_1 == "New York");

-- The Dump command can dump the content of a Relation on screen. Every row in a
-- is termed as a "Tuple". This will also trigger the underlying MR job.
DUMP Filter_A;

-- Group By operation in Pig Latin is as easy as applying the GROUP command
-- The grouped result will be stored in Alias "Grp_A"
Grp_A = GROUP A_Rel BY col_1;  -- Grp_A is an alias.

-- As it is already known, DUMP will emit the content of Grp_A on screen.
-- The content will contain a Relation that contains the grouping column value
-- followed by a group of unordered tuples (which is called a Bag, similar to a List),
-- followed by another grouping column value ... etc.
-- The first column (e.g. with values like- col_1_val etc) will contain a column called
-- group with value of grouping column, "col_1" in this case. The second column will be
-- the name of grouped relation, (e.g. A_Rel in this case) and that contains all the tuples
-- with same value of col_1 in it.
-- Like:   group       | A_Rel
--         ------------|---------------------------
--         col_1_val,  | { {tuple 1}, {tuple 4} }
--         col_2_val,  | { {tuple 2}, {tuple 3} }
--         ------------|---------------------------
DUMP Grp_A;  -- Until DUMP, you are working on Logical Plan only.

-- Join between 2 relations is a breeze, just say- JOIN and list the relation (R2 or R3) and a
-- BY keyword followed by column name OR (zero based) column index ($0 or $3) for that set.
r2 = LOAD '/path/to/another/data/set' ;
r3 = LOAD 'table_3' USING HCatLoader();  -- If table definition was stored in HCatalog
Joined_C = JOIN r2 BY $0, r3 BY emp_id;

Relations (or Sets or Record Sets) can refer and use fields that are associated with an Alias, e.g. in the FILTER statement we specify an Alias and use its field in an expression … “filtered_sals = FILTER salaries BY income > 70000″. Here salaries is an alias pointing to a relation and income is a field within that relation..


Wednesday, May 27, 2015

Apache Pig at a High Level

Filed under: Technical,Big Data,Hadoop,Pig,Map-Reduce,HDFS — Vipul Pathak @ 11:28
Tags: , , , , , , ,

Pig is a data flow language developed at Yahoo and is a high level language. Pig programs are translated into a lower level instructions supported by underlying execution engine (e.g. Map/Reduce programs). Pig is designed for working on complex operations (e.g. Joins) with speed.

Pig has mainly 2 components- Interpreter and Pig Latin (Scripting Language).
Pig Latin supports the following execution modes-

  1. Script Mode: Script containing Pig Latin statements.
  2. Grunt Shell Mode: This is interactive mode, where we start the Grunt shell and enter Pig Latin statements interactively. 
  3. Embedded Mode: Using the PigServer java class, we can execute a Pig query from within the Java class.

Grunt Modes

  • Local: pig -x local [script_name.pig]
  • MapReduce: pig [-x mapreduce] [hdfs://…/script_name.pig]
  • Tez: pig -x tez [script_name.pig]

Pig statements are executed by Pig Interpreter after validation is done. Instructions in the script/statement are stored in Logical Plan that the Interpreter makes and are executed after converting them into a Physical Plan when a STORE or DUMP or ILLUSTRATE statement is encountered. Pig internally uses MapReduce or Tez execution engine and Pig scripts are eventually converted into MR jobs to do anything they offer.

Multiline Comments  /* are covered by C style comment markers */. Single line comments are marked with  — which is more of a SQL like commenting style.

Design philosophy of the Pig tool, are compared to actual Pig (the animal)

  • Pigs eat anything: Pig (the tool) can process any kind of data (structured or unstructured, with or without Metadata, nested/relational/Key-Value).
  • Pigs can live anywhere: Pig (the tool) initially supported only MapReduce but isn’t designed for only one execution engine. Pig also work with Tez, another execution engine that work on Hadoop platform.
  • Pigs are domestic animals: Like Pig (the animal), Pig (the tool) is adjustable and modifiable by users. Pig support a wide variety of User Define Code, in the form of UDFs, Stream commands, custom partitioners etc.
  • Pigs Fly: Pig scripts are fast to execute. Pig evolves in a way that it remain fast to execute and doesn’t become heavy weight.

Install Pig by downloading it from and untaring it (tar -xvf <tar_file_name> ). Define the Pig installation directory as environment variable PIG_PREFIX and add $PIG_PREFIX/bin to the PATH variable. Also define the HADOOP_PREFIX environment variable. Start Pig’s Grunt shell using the pig command, since HADOOP_PREFIX is defined, Pig will be able to connect to Hadoop Cluster seamlessly.
All HDFS commands work from inside the Grunt shell, e.g we can directly issue the command copyFromLocal or lsr or cd or pwd, without "! hadoop".

Wednesday, May 20, 2015

Default Values Set in Job class in Map/Reduce v2

Filed under: Big Data,Hadoop,Map-Reduce,Technical — Vipul Pathak @ 10:34
Tags: , , , , ,


The following are the values that are set by-default, when we don’t set them explicitly –

Job.setInputFormatClass:    TextInputFormat  (Outputs the starting position of each line as Key (Long), and line content as Value (Text))
Job.setMapperClass:         Mapper           (Default mapper, called IdentityMapper in MRv1, which simply output Key/Value as is)
Job.setMapOutputKeyClass:   LongWritable     (The type of key in intermediate Mapper output)
Job.setMapOutputValueClass: Text             (The type of value in intermediate Mapper output)
Job.setPartitionerClass:    HashPartitioner  (Partitioner class that evaluate each record and decide which Reducer to send this record)
Job.setReducerClass:        Reducer          (Default Reducer, similar to IdentityReducer that output what goes in as an Input)
Job.setNumReducetasks:      1                (By default only one Reducer will process output from all Mappers)
Job.setOutputKeyClass:      LongWritable     (The type of key in the Job’s output created by the Reducer)
Job.setOutputValueClass:    Text             (The type of value in the Job’s output created by the Reducer)
Job.setOutputFormatClass:   TextOutputFormat (Writes each record on a separate line using tab character as separator)

Similarly, following settings are default for Streaming API when we don’t set them explicitly (defaults are in bold rust) –

$ export HADOOP_TOOLS=$HADOOP_HOME/share/hadoop/tools
$ hadoop jar $HADOOP_TOOLS/lib/hadoop-streaming-*.jar
  -input           MyInputDirectory  \
  -output          MyOutputDirectory  \
  -mapper          /user/ubuntu/  \  # This can be any Unix Command or even a Java class
  -inputformat     org.apache.hadoop.mapred.TextInputFormat  \
  -partitioner     org.apache.hadoop.mapred.lib.HashPartitioner  \
  -numReduceTasks  1  \
  -reducer         org.apache.hadoop.mapred.lib.IdentityReducer  \
  -outputformat    org.apache.hadoop.mapred.TextOutputFormat  \
  -io              text  \
  -D     \t  \
  -file            /user/ubuntu/     # This needs to be transported to each Mapper node

Unlike the Java API, the mapper needs to be specified. Also, the streaming API uses the old MapReduce API classes (org.apache.hadoop.mapred). The file packaging options –file OR –archives are only required to be added if we want to run our own script that is not supposed to be present on all cluster nodes. The unix pipes are not supported in the mapper, e.g. “cat dataFile.txt | wc” will not work.


Sunday, May 17, 2015

Apache Sqoop in a Nutshell

Filed under: Hadoop,HDFS,Sqoop,Technical — Vipul Pathak @ 15:46
Tags: , , , ,


Sqoop is a utility that can be used to transfer data between SQL based relational data stores tO/from hadoOP. The main operation this utility carry out is performing a data Import to Hadoop from supported relational data sources and Exporting data back to them. Sqoop uses connectors as extensions to connect to data stores. Currently Sqoop supports connectors to Import/Export data from popular databases like Oracle, MYSQL, Postgres, SQL Server, Teradata, DB2 etc. While SQOOP can do more, but the General use includes the following –

  • Import data from database (sqoop  import  OR  sqoop-import command)
  • Export data to the database  (sqoop  export  OR  sqoop-export command)
  • Create a Hive Table using the table structure of similar table in a database

SQOOP uses Map Reduce in the background to launch Jobs that can read/write data using DBInputFormat class and process data in parallel using multiple Mappers. Both Import and Export actions support -m options where the user can specify number of Mappers that SQOOP should launch to perform parallel operation. During import either a –table option can be used OR –query but not both. Here is the summary of some options below and their expected outcome –

sqoop import
–connect jdbc:mysql://server/db32
–username user_1 –password pass2
–table src_table_name –target /hdfs/dir/name

Imports data from a MYSQL database named db32 using the credentials user_1/pass2. Reads the table src_table_name and places data in an HDFS directory /hdfs/dir/name as a text file.


sqoop import
–connect jdbc:mysql://server/db32
–username user_1 –password pass2
–table src_table_name
–columns "col1,col2,col3,col4"
–append  –target /hdfs/dir/name

Imports data with settings same as above but only read the columns specified in the columns list, then append the data to an already existing HDFS directory and save data in Sequence files.

sqoop import
–connect jdbc:mysql://server/db32
–username user_3 –password pass4
–query "SELECT id,name FROM tbl3 WHERE \$CONDITIONS"
–target /hdfs/dir/name
–split-by UID   -m 6

Imports data from MYSQL but use the query to select data instead of using a table and run 6 Mapper jobs (instead of default 4) to read data in parallel, and use the values of UID column to create 6 queries that can run in parallel. Suppose if UID column has values from 1 to 60,000, the $CONDITIONS will be replaced to create conditions like 1 to 10000, 10001 to 20000, … etc. However, if the table doesn’t have any primary key column, then either use -m 1 (OR) use –split-by clause.


sqoop create-hive-table
–connect jdbc:mysql://server/db32
–username user_3 –password pass4
–table src_table_name
–fields-terminated-by ‘,’

Creates a Hive table with the same name and schema as the source table. The destination data files will be saved with fields separated by a comma and if the destination table already exists, an Error will be thrown.


sqoop import
–connect jdbc:mysql://server/db32
–username user_3 –password pass4
–table src_table_name

Imports data from source database, creates a table in Hive and then load the data into the Hive table. If the Hive table already exists and there is data in it, then an Append will happen.


sqoop export
–connect jdbc:mysql://server/db32
–username user_3 –password pass4
–table db_table_name
–export-dir /read/from/dir

Exports data from HDFS to MYSQL into the db_table_name. The table in database must exists already and so Sqoop will append data into it.


sqoop export
–connect jdbc:mysql://server/db32
–username user_3 –password pass4
–table db_table_name  –update-key UID
–export-dir /read/from/dir
–input-fields-terminated-by ‘,’

Exports data from HDFS to the database but tells Sqoop that the fields in the input data are separated by the comma and if the value of UID is already found in the database table then perform an update instead of inserting the row.


Happy Sqooping  :-)


Thursday, May 14, 2015

Apache Hive From 5000 Feets

Filed under: Big Data,Hadoop,Hive,Technical — Vipul Pathak @ 16:33
Tags: , , , ,
Apache Hive is an abstraction on top of HDFS data, that allow querying the data using the familiar SQL like language, called HiveQL (Hive Query Language). Hive was developed at Facebook to allow data analysts to query data using an SQL like language. Hive has limited commands and is similar to basic SQL (advance SQL options are not supported), but provide nice functionality on top of this limited syntax. Hive is intended to provide data warehouse like functionality on top of Hadoop and helps defining structure for the unstructured big data. It provide features to query analytical data using HiveQL.
Hive is not suitable for OLTP (since it is based on Hadoop and its queries have high latency), but is good for running analytical batch jobs on high volume data that is historical or non-mutable. Typical use case include copying unstructured data on HDFS, Running multiple cycles of Map/Reduce to process data and store (semi-)structured data file on HDFS and then using hive to define/apply the structure. Once structure is defined, data analyst’ point to these data files (or load data from these data files) and run various type of queries to analyze data and find hidden statistical patterns out of the data  :-)
Hive closely resemble with SQL and at a high level, support the following type of activities –
  1. Creation and Removal of Databases
  2. Creation, Alteration and Removal of Managed/External Tables within (or outside of) these databases
  3. Insertion/Loading of data inside these Tables
  4. Selection of data using HiveQL, including the capability of performing Join operations
  5. Allows the use of aggregation functions in the Select queries, that most of the times, triggers Map/Reduce jobs in the background.
  6. Structured output on query, (e.g. Schema on Read) not while loading data but when retrieving data.

Friday, July 26, 2013

Difference Between Software Architecture And Software Design

I had this question in my mind for years. Based on some experience, some good books (listed in good reads) and my constant interaction with some architecture geniuses at work, I now have clarity on this :-)

Generally speaking, architecture is a branch of design. So every architectural work is a kind of design, but the reverse is not true. Not all design is architecture. Architectural designs addresses a specific goal. The goal almost always is to assure that the system meet its quality attributes and behavioral requirements. The architect creates design that manifest significant decisions and constraints in the system’s architectural structure. There are also design decisions, that are not architecturally significant, i.e. does not belongs to the architecture branch of design. We can view them as some component’s internal design decisions, like- choice of algorithm, selection of data structure etc. Any design decision, which isn’t visible outside of its component boundary is a component’s internal design and is non-architectural. These are the design decisions a system architect would leave on module designer’s discretion or the implementation team. The implementation team is free to design their subsystem as long as their design don’t break the architectural constraints imposed by the system level architecture.

Architecture can be detailed or descriptive, such as defining a protocol that will be used by outside world entities to interact with the system (e.g. A hypermedia RESTful interface for a service of a large system), OR it may be some constraints defined at high level to support quality attributes and to be followed by the system’s internal implementation (e.g. The usage information of all the 6 resources should be collected every 15 seconds and stored in-memory in binary format with memory usage of no more than 16 bits per sample; The usage info buffer should be uploaded to server every 8 hours or whenever the coverage is available after 8 hours).

Architectural design deals with the quality attributes (performance, security, changeability, reliability, user experience etc.) and behavioral requirements (e.g. Usage information of devices should be upload to server with least possible network footprint) of the system/element. Anything that needs design but is not in the above two categories, IS element’s internal design AND NOT system’s architectural design. However, module designers may have to introduce structure or may have to take design decisions to fulfill the behavioral requirements of their module. In that case, those design decisions are architectural to their module, but not to entire system. Hence, design decisions to be called as architectural or not, is a function of the context.

Saturday, July 20, 2013

What is Software Architecture – Some Thoughts

Filed under: Architecture,Technical — Vipul Pathak @ 03:49
Tags: , , , ,


Casually speaking, Software architecture is the careful partitioning or subdivision of a system as whole, into sub-parts with specific relationship between these sub-parts. This partitioning is what allows different people (or group(s) of people) to work together cooperatively to solve a much bigger problem, then any of them could solve by individually dealing with it. Each person (or team of persons) creates a software part (you may want to call it a component, module or a sub-system) that interacts with other person’s (or team’s) software part. This interaction happens through carefully crafted interfaces that exposes minimum but most stable information necessary for interaction.

The larger and complex the system is, the more critical is this subdivision. A system is almost always, unavoidably can be divided into parts in more than one way. Each way of division, results in creation of architectural structure. Each way of dividing the system into parts, relationships between them and the way they interact (interface), is a result of careful design to satisfy the driving quality attribute (security, performance, changeability etc.) and business goal behind them. So a system may be sub-divided into parts in different ways, if the driving quality attribute or business goal is different.

Finally, a formal definition of software architecture from IEEE –

“The fundamental organization of a system embodied in its components, their relationships to each other, and to the environment, and the principles guiding its design and evolution.”

Why architecture is important?

Software architecture is a set of design decisions which are hard to revert later and if made incorrectly, may cause the project to cancel or be unsuccessfully. Since these architectural design decisions are the ones that permits a system to meet its quality attributes and behavioral requirements, if these decisions go wrong, the system cannot achieve the desired quality or behavior. The quality attributes are usually parts of requirements (non-functional requirements) and failing to fulfill these requirements automatically invalidate the solution.

Sunday, May 30, 2010

Friday, May 14, 2010

Status Returned By BlackBerry API During "Data Connection Refused" Conditions

The following reading were recorded on 13 BlackBerry mobile phone. Some of them (columns marked with *) were Blackberry phone with correct Blackberry data plan, while some of them were simple non blackberry phones (like- Nokia 6600, explicitly shown) with GPRS connectivity. One of them were even lacking the GPRS connectivity. I studied the data connectivity with an intention to record the behaviour of RIM’s BlackBerry API in low or no data connectivity conditions.

The BlackBerry devices used of the study includes- BlackBerry 9000 (Bold), BlackBerry 8320 (Curve), BlackBerry 8110 (Pearl), BlackBerry 8800, BlackBerry 8300 (Curve) etc. The BlackBerry JDE used was 4.3.

Row number 1 and 2 (Carrier? And MDS?) represents values returned by CoverageInfo.getCoverageStatus().

All “True +ve” shown above are real Data Connection Refused cases. Below is a cut down version of the above table with only the required data. Two of “True +ve” are matching with each other (Column D and E below). However, Column C (being a True +ve) is matching with all the other “False +ve”, e.g. Column B.

Due to identification of Column B as Out of Coverage (though the indicator on home screen shows capital EDGE), the conclusion based on various Blackberry APIs is confusing.

Saturday, May 08, 2010

Setting READ/WRITE Permissions for the ISAPI Extension Hosted on IIS

The script (or the ISAPI Extension) that we are going to deploy and grant additional permissions, will be writing to a File and hence need WRITE permissions as well. Here is how we will do it:

Find out the User Id

IIS’s World Wide Web Publishing Service executes using the Local System account, but it usually impersonates a different account to execute any ISAPI extension. Do the following to discover which User Id is being used by IIS for impersonating and running the extension:

1. Start Internet Information Services Manager from Administrative Tools under Control Panel.

2. Expand the tree on the left side and select the Website under which the ISAPI extension will be deployed.

3. Next, on the right side, under the IIS group, double click on the Authentication icon (see below):

4. In the resulting listing, under the Features View tab, choose Anonymous Authentication and click on the Edit action.

5. The identity selected in the resulting dialog box is used by IIS for impersonating (see below):

If the identity select isn’t shown clearly, but instead the Application Pool Identity is shown as selected, here is how to find which account is used by the Application Pool:

6. From the left pane, select the Website and click on Basic Settings link from the Actions area at the right. The resulting dialog will tell us, which Application Pool is being used by this Website (see below):

7. Click Cancel and dismiss the dialog. Next, from the left pane, select the Application Pools node just above the Sites node. The right pane will show available application pools. Select the Application Pool, that’s being used by our Website.

8. Now click on the Advanced Settings action under the Actions area on the right side (see below):

9. The Identity property shows the account used by the application pool.

Setting the Directory Permission on NTFS

Now that we know, which account is used by IIS for the purpose of impersonation, let’s set WRITE permissions on the correct directory for the User Id used by IIS. Let’s say- we have an ISAPI Extension or a script that creates or updates a file on the local file system. Based on what we know, we need to grant WRITE permission on a folder where our ISAPI Extension will write. Now since, we know the User Id that will be used by IIS for impersonation, let grant write permission to that user, (see below):

Finally, restart IIS. From the command line, use the command IISReset. We are done J.

Installing an ISAPI Extension as a Wildcard Handler

Filed under: Technical,Web — Vipul Pathak @ 13:47
Tags: , , ,

Starting from IIS 6.0, an ISAPI Extension can be added to any Web site or Application, as a wildcard script. By wild card script, I mean an ISAPI extension can be registered with IIS to execute on every incoming request on Web site/Application. All we need to do it register the extension as a wildcard handler. A wildcard handler receives the incoming request, before the actually intended recipient. There may be multiple wildcard handlers registered on the same site or application, which will all be executed in order. The ISAPI Extension, when invoked by IIS, can process the request and pass the request to another handler in chain. Below is how we will register the wildcard script on an application (or website):

IIS 6.0

  1. Start IIS Manager and expand the tree on the left side.
  2. Select any Application/ Website on which you want to install the wildcard handler. Right click and select Properties.
    Application Properties
    Application Properties
  3. In the Application Settings section on Home Directory or Virtual Directory tab, click on the Configuration.
  4. In the Wildcard Application Maps section on the Mappings tab, click the Insert button.
    Specifying a wildcard handler DLL
    Specifying a wildcard handler DLL
  5. Type or Browse the path of our ISAPI Extension DLL and click OK 3 times to go back at work.

IIS 7.0

  1. Start IIS Manager and expand the tree on the left side.
  2. Select any Application/ Website on which you want to install the wildcard handler.
  3. On the right pane, within the features view, double click on the Handler Mappings icon under IIS section (as shown below).
    Mappings Handler Settings in IIS 7.0
    The Handler Mappings Shortcut in IIS 7.0
  4. Once the page containings all the existing mappings opens, locate the link Add Wildcard Script Map… at the right side and click it (as shown below).
    Link to Add a Wildcard Handler Script
    Link to Add a Wildcard Handler Script
  5. In the resulting dialog, specify the path of our ISAPI extension DLL and give it a name (Below):
    Specify the ISAPI Extension DLL
    Specify the ISAPI Extension DLL
  6. We are done  :-)

Thursday, August 20, 2009

Hello World!

Filed under: Uncategorized — Vipul Pathak @ 13:26

Greetings !!

Wikipedia is great source of free knowledge for nearly all of us. Lets start with supporting Wikipedia :
Wikipedia Affiliate Button

There is also a great source of software engineering audio podcasts available, that you can find at: If you like the podcast, you may want to support SE-Radio too:
Support Software Engineering Radio

Will meet you soon here, with my first article.


The Rubric Theme. Create a free website or blog at


Get every new post delivered to your Inbox.