Hive UDF test

get dual table by

CREATE TABLE dual (dummy STRING); 
INSERT INTO select 'x' from another_table limit 1;
select regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 1) from dual;

select regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 1) from dual;

insert into and insert overwrite


  • INSERT INTO to append data to a table.
  • INSERT OVERWRITE to replace the data in a table; each new set of inserted rows replaces any existing data in the table.


# from a table with the same definition 
insert into table text_table select * from default.tab1;

# The VALUES clause is a general-purpose way to specify all the columns of a row or multiple rows
insert into val_example values (1,true,100.0);


union all

The SQL UNION ALL operator is used to combine the result sets of 2 or more SELECT statements. It returns all rows from the query (even if the row exists in more than one of the SELECT statements).


HDFS=hadoop distributed file system

Hadoop = HDFS + MapReduce

Hive = provide a SQL layer over HDFS or other file system


Apache Giraph is an Apache project to perform graph processing on big data. Giraph utilizes Apache Hadoop‘s MapReduce implementation to process graphs. Facebook used Giraph with some performance improvements to analyze one trillion edges using 200 machines in 4 minute

many to many relationship design

there are many videos, each videos has many user’s comments;

videos(id, title)

user(id, login)

joint table comments(id, video_id, user_id, comment_string)

why? one video can have many comments, one user can have many comments.

better than (video_id, comment_id) and (comment_id, user_id)

why? there is no same (comment+user) for different videos. so there is no need to have two joint tables.