Subject: Re: [erlang-questions] List Question



On 2017年08月07日 月曜日 23:46:52 you wrote:
> Hello Craig,
>
> Thanks for your help.
>
> I am trying to store the data as efficiently as possible. Its HL7
> natively and this is my test:
>
> OBX|17|FT~TEST|8265-1^^LN&SUBCOMP|1&2&3&4|\H\Spot Image 2\N\||||||F
>
> |~^& are delimiters. The hierarchy is only so deep and using lists of
> lists to provide a tree like way to access the data eg Field 3, repeat
> 1 component 2 subcomponent1
>
> Parsed it looks like this:
>
> [["OBX","17",
> ["FT","TEST"],
> [["8265-1",[],["LN","SUBCOMP"]]],
> [[["1","2","3","4"]]],
> "\\H\\Spot Image 2\\N\\",[],[],[],[],[],"F"]]
>
> As the format evolves over time the hierarchy can be extended, but
> older clients can still read the value they are expecting if they
> follow the rules, like reading the first value in the list when you
> only expect one value to be there.
>
> Currently a typical system might have 12 million of these records so
> want to keep format as small as possible in the erlang format, hence
> reluctant to tag 2 much, but know how to get value of interest. Maybe
> that is my non erlang background showing up? Traversing 4 small lists
> by index should be fast??
>
> I guess I could save strings as binary in the lists then is_binary
> should work?? Is that the case. I gather on 64bit system especially
> binary is more space efficient.

We would really want to know a bit about the data's natural semantics before
leaping to conclusions regarding "most efficient" or especially "best" internal
representation.

On the note of "internal representation"... do you need 12 million records in
memory at a given time, or are these things you can store as strings and then
pull the ones you want as you go? Do you need random access to records? To be
able to query them? Are you performing aggregate operations over them? All of
these might change the answer of "what is best".

For example, if you need a few tens of millions of records in memory at a given
time AND the short strings like "FT" and "TEST" are always members of a
smallish, finite set you know beforehand then atoms will be much more space
efficient if that is your main concern. Atoms are tiny in memory (essentially
like using integer constants in a C program), but they serialize back to
strings in storage so that can change things. For example, the atom 'test' (or
'TEST') is smaller in memory than either the string "TEST" or the binary
<<"TEST">>.

In any case, just to get the code right first so you have a platform to change
things around I would recommend TOTALLY FORGETTING ABOUT THIS to start with and
jump straight to tagged pairs. This will keep your semantics straight and
dramatically reduce the mental overhead of writing traversal functions. You can
switch that stuff around later -- but first you need things to work at least
partially before you can start measuring stuff.

For example:

> [["OBX","17",
> ["FT","TEST"],
> [["8265-1",[],["LN","SUBCOMP"]]],
> [[["1","2","3","4"]]],
> "\\H\\Spot Image 2\\N\\",[],[],[],[],[],"F"]]

I would parse that to something like this:

[{foo, "OBX"},
{bar, 17},
{baz, {"FT", "TEST"}},
[[{biz, "8265-1"}
{boz, []},
{ballz, {"LN", "SUBCOMP"}}]],
[[{bonz, [1, 2, 3, 4]}]],
{coz, "\\H\\Spot Image 2\\N\\"},
{fuzz, "F"}]

The leading tag should always disambiguate the MEANING of the element you are
encountering in your lists. Here I used junk terms, but I think you get the
idea. Whatever atom labels you use should be unambiguous and mean something to
the humans reading the code (or crash dumps).

The overhead here VS the list version is the atoms and enclosing tuple (a few
bytes for both) -- which is negligible compared to the cost of using actual
strings directly in your code. Remember, an atom is basically an alias for a
constant, so don't fret over it too much -- just don't start doing insane
things like dynamically generating an arbitrary number of unique atoms!

But again, remember, we don't care how big this is in memory yet. What we care
about is that it is easy to traverse and know what you mean when you do so, and
also that it is easy to unambiguously cast the data from an old format to a new
one (which is inherently easier with tagged data than with semantically
amibugous lists of lists).

Obviously giving further advice would be a bit easier if I knew more about the
meaning of the data itself, but anyway, the primary concern here is getting the
logic right first. You can crunch things down later -- and most of the time
that usually has much less to do with finding a "small" internal representation
for those moments while you are actively handling some bits of data in memory
and a LOT more to do with how you are going to index serialized data on disk
(ETS/DETS/Mnesia may be a good route for that, or Postgres, or maps, whatever
-- depending on what you're up to).

Hopefully I explained more than confused.

-Craig
_______________________________________________
erlang-questions mailing list
[email protected]
http://erlang.org/mailman/listinfo/erlang-questions



Programming list archiving by: Enterprise Git Hosting