I mentioned in my article that , It's possible to load some columns of a file . Two , You only have one column , That's all right. .

however , Two documents ,f1 and f2,f1 Yes 42 Column ,f2 Yes 43 Column , Load into a stream object at the same time , how ?

answer : Successfully loaded . But no structure (schema unknown),discribe See later :Schema for origin_cleaned_data unknown.

This situation is similar to union, Merge two objects with different columns , An unknown schema object is generated .

background : Because the old log 42 Column , Add one more column to the new log at 20 Column , because 20 Column cannot be followed by the same name , The total number of user clicks in the log . So load together , Unified statistics .

( If you know the type of log for different dates , You can read in , Specify a clear pattern , And then use onschema Conduct uion, In separate statistics . It's a pity to accept the project , I'm not sure when I changed it online )

sampling : Old journal log_without.txt, New log log_with_android_ad_id.txt

The code is as follows

REGISTER piggybank.jar;

DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();

%default cleanedLog /user/wizad/tmp/log_*

--%default cleanedLog1 /home/wizad/lmj/log_without.txt

--%default cleanedLog 2/home/wizad/lmj/log_with_android_ad_id.txt

origin_cleaned_data = LOAD '$cleanedLog' USING PigStorage(','); 

DUMP origin_cleaned_data;

DESCRIBE origin_cleaned_data;

Show results :

((null) 5,74,48809e40-b8d7-41a4-bf68-d0f8e28140ad,575356365101899146,2014-07-30 10:33:56,2014-07-30 10:33:56,1,57074,2,,,,,,,,,1,-1,-1,lmj,-1,1ac2c73e-d93a-4801-a7ee-da05473d0585,48809e40-b8d7-41a4-bf68-d0f8e28140ad,02:00:00:00:00:00,1940064625594046032,,,,d70cc494,25100,206,,0,2,2,7.1,,,,42.833298,12.833298,120232210032202)

((null) 5,74,357633052513139,1033882907630785616,2014-07-30 11:15:05,2014-07-30 11:15:05,1,57074,2,,,,,,,,,1,357633052513139,270f213575a4eda7,lmj,270f213575a4eda7,,,40:0e:85:40:0e:1a,-7537294162085162169,,,,7626e397,62713,206,,2,1,3,4.3,,,,37.774902,-122.4194,023010203333003)

((null) 5,74,e7a4afce-ffd9-4ecd-b916-39f9d793c218,207640323432175503,2014-07-30 10:29:22,2014-07-30 10:29:22,1,57074,2,,,,,,,,,1,-1,-1,lmj,-1,14ea5e95237f34e278d7ac210173d6b8ad9d5026,e7a4afce-ffd9-4ecd-b916-39f9d793c218,02:00:00:00:00:00,1179719885610920154,,,,d4eeab6e,66104,101,,0,2,2,7.1,1,7,7,39.928894,116.388306,132100103322203)

((null) 5,74,48809e40-b8d7-41a4-bf68-d0f8e28140ad,575356365101899146,2014-07-30 10:33:56,2014-07-30 10:33:56,1,57074,2,,,,,,,,,1,-1,-1,-1,1ac2c73e-d93a-4801-a7ee-da05473d0585,48809e40-b8d7-41a4-bf68-d0f8e28140ad,02:00:00:00:00:00,1940064625594046032,,,,d70cc494,25100,206,,0,2,2,7.1,,,,42.833298,12.833298,120232210032202)

((null) 5,74,302bd8f1-b974-4af5-8183-1f67d27410d6,367366268601246781,2014-07-30 10:07:57,2014-07-30 10:07:57,1,57074,2,,,,,,,,,1,-1,-1,-1,c165376f9f76cf68862a505328b7ba7cd0cfa0b0,302bd8f1-b974-4af5-8183-1f67d27410d6,02:00:00:00:00:00,-488564527359896578,,,,103b14d3,25100,206,,0,2,2,7.1,,,,37.774902,-122.4194,023010203333003)

((null) 5,74,e7a4afce-ffd9-4ecd-b916-39f9d793c218,207640323432175503,2014-07-30 10:29:22,2014-07-30 10:29:22,1,57074,2,,,,,,,,,1,-1,-1,-1,14ea5e95237f34e278d7ac210173d6b8ad9d5026,e7a4afce-ffd9-4ecd-b916-39f9d793c218,02:00:00:00:00:00,1179719885610920154,,,,d4eeab6e,66104,101,,0,2,2,7.1,1,7,7,39.928894,116.388306,132100103322203)

Schema for origin_cleaned_data unknown.

One more column, the value is lmj The column of . You can see no structure .

union: Merge columns of different formats

(union Don't repeat it )

A = load 'input1' as (x:int, y:float);
B = load 'input2' as (x:int, y:chararray);
C = union A, B;
describe C;
 Show results :
Schema for C unknown

Two variables without column names union use onschema

We need to pay attention to : Use onschema, All input needs to be clear schema, otherwise , error . because union When merging , The comparison is by name and column type ( Can automatically convert from low level to high level ).

After the merger , The empty column will make up for null.

A = load 'input1' as (w: chararray, x:int, y:float);
B = load 'input2' as (x:int, y:double, z:chararray);
C = union onschema A, B;
describe C;
result :
C: {w: chararray,x: int,y: double,z: chararray}

Give a not union Code example of

%default cleanedLog1 /home/wizad/lmj/log_without.txt

%default cleanedLog2 /home/wizad/lmj/log_with_android_ad_id.txt

origin1 = LOAD '$cleanedLog1' USING PigStorage(','); 

origin2 = LOAD '$cleanedLog2' USING PigStorage(',');

DESCRIBE origin1

DESCRIBE origin2

origin = union origin1,origin2

result :

origin1 and origin2 Show Schema for origin2 unknown.

therefore origin Can't generate

