Usually textscan
and regexp
is the way to go when parsing string fields (as shown here):
Read the input lines as strings with
textscan
:fid = fopen('input.px', 'r'); C = textscan(fid, '%s', 'Delimiter', '\n'); fclose(fid);
Parse the header field names and values using
regexp
. Picking the right regular expression should do the trick!X = regexp(C{:}, '^\s*([^=\(\)]+)\s*=\s*"([^"]+)"\s*', 'tokens'); X = [X{:}]; %// Flatten the cell array X = reshape([X{:}], 2, []); %// Reshape into name-value pairs
The "VALUE" fields may span over multiple lines, so they need to be concatenated first:
idx_data = find(~cellfun('isempty', regexp(C{:}, '^\s*Data')), 1); idx_values = find(~cellfun('isempty', regexp(C{:}, '^\s*VALUES'))); Y = arrayfun(@(m, n){[C{:}{m:m + n - 1}]}, ... idx_values(idx_values < idx_data), diff([idx_values; idx_data]));
... and then tokenized:
Y = regexp(Y, '"([^,"]+)"', 'tokens'); %// Tokenize values Y = cellfun(@(x){{x{1}{1}, {[x{2:end}]}}}, Y); %// Group values in one array Y = reshape([Y{:}], 2, []); %// Reshape into name-value pairs
Make sure the field names are legal (I've decided to convert everything to lowercase and replace apostrophes and any whitespace with underscores), and plug them into a struct:
X = [X, Y]; %// Store all fields in one array X(1, :) = lower(regexprep(X(1, :), '-+|\s+', '_')); S = struct(X{:});
Here's what I get for your input file (only the header fields):
S =
charset: 'ANSI'
matrix: 'BE001'
subject_code: 'BE'
subject_area: 'Population'
title: 'Population by region, time, marital status and sex.'
month: {1x12 cell}
region: {1x5 cell}
As for the data itself, it needs to be handled separately:
Extract data lines after the "Data" field and replace all
".."
values with default values (say,NaN
):D = strrep(C{:}(idx_data + 1:end), '".."', 'NaN');
Obviously this assumes that there are only numerical data after the "Data" field. However, this can be easily modified if this is not case.
Convert the data to a numerical matrix and add it to the structure:
D = cellfun(@str2num, D, 'UniformOutput', false); S.data = vertcat(D{:})
And here's S.data
for your input file:
S.data =
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN 24.80000 34.20000 52.00000 23.00000
NaN 32.10000 40.30000 50.70000 1.00000
NaN 31.60000 35.00000 49.10000 2.30000
41.20000 43.00000 50.80000 60.10000 0.00000
50.90000 52.00000 53.90000 65.90000 0.00000
Hope this helps!