Exceptions and Rejection with flex table

 Exceptions and rejected data files is not currently supported while loading flex tables.
Are you guys going to do something for it?
when i use copy command for   a flex table for copying from json file  using fjsonparser,
, it just reject all data in a file, if it contains any invalid json.

Thanks,
naveen

Comments

  • On encountering an error in fjsonparser the import continues, but no more rows are imported - this is very annoying...
  •  It kind make sense that your data is rejected as you need a valid json in order for you to use it , oher wise it's interpreted as a txt or a csv file - anything but json.
     You should validate your json file before loading it. Use the online tool a the http://jsonlint.com/ 
  • My JSON is valid, it's directly generated from python's json.dumps(). The data is around 1 GB, it's not so easy to validate 1 GB of data using online validator..
  • Hi Libor,

    We're aware that rejection for JSON would be a useful feature, though thanks for pointing it out again.  It's a little tricky (though not fundamentally hard) -- Vertica can parse JSON documents which are much larger than system memory, so must use a streaming parser; typical existing streaming parsers don't provide a mechanism to rewind the stream to capture the entirety of the record that is being rejected.

    That said, my understanding is that we do try to provide basic support for bad records -- we won't tell you in depth what's wrong, but we will try to recover and continue parsing with the next record.

    It sounds like that's not happening for you.  Usually this indicates some deeper sort of corruption; maybe as simple as an extra close bracket somewhere, maybe much more complex.

    Can you narrow down what in the file is causing this issue?

    Thanks,
    Adam
  • Minimal rejection support does exist for Flexible Tables and is expected to increase with time. In Crane, this is only the "reject_on_duplicate" parser argument.

    To the issue here, without record terminators existing in the JSON format, an error in the JSON itself means that where the separations betweeen records are is ambiguous and, therefore, we currently fail the load. Look in vertica.log to see the characters which caused the syntax error and failed load. We'd be interested in heaering desired behavior, given the record terminator ambiguity.

    Regarding general JSON validity, if it passes a syntax check such as JSONLint.com, it should load without problem. We'd be very curious and grateful to see an example which passes with JSONLint but fails with the FJSONParser.
  • Hi,
    Can you put some example where i can use this parameter with jsonpaser ,
    i did try to copy same data twice with this parameter but it didn't work.as 
    copy flextable from '/newdel.json' parser fjsonparser(reject_on_duplicate='true');
    Rows Loaded-------------
               1

    copy flextable from '/newdel.json' parser fjsonparser(reject_on_duplicate='true');

    Rows Loaded-------------
               1

    Am i missing something?

    Thanks,
    Naveen
  • I believe is for rejecting if you have two keys with the same name in the same row; not if you have two rows with the same value in the same table.
  • Ohh Yea, I got it.
    Thanks. 

    But there must be some kind of parameter/configuration for rejecting duplicate tuple while loading flex table.
    For Unique tuple copying distinct from one table to another table is totally annoying .

  • Thanks for a prompt response. The only thing I found in vertica.log is
    LOCATION:  reject, /scratch_a/release/vbuild/vertica/EE/ImpExp/RejectRows.cpp:42
    To be honest I didn't expect that there is no record terminator, I'm using newlines. This would be the desired behaviour for me - each record in single line, or allow the record separator to be configurable in the copy query (it could retain current behaviour if not set).

    I'm not able to find the offending record. When I try to extract a block of 100 records (lines for me) around the failing one, all of them import without problems.
    Using the __identity__ field, I checked the last imported record, it is indeed the line before the record # that failed and is imported correctly. I'm puzzled.

    Wondering if it's not connected to the input data size, I tried this (record 90764 fails):
    <neighborhood.json.gz zcat | head -90763 | wc    90763 3502779 1362091697
    The record sizes are small:
    $ <neighborhood.json.gz zcat | tail -n +90764 | head -1 | wc        1      41    8610  $ <neighborhood.json.gz zcat | tail -n +90763 | head -1 | wc        1      15     230
     Thanks!


  • Greetings! A few things to respond to here:

    When looking for the error in vertica.log, look for lines which contain "<UDx>". If your issue is one of record size, you will get one or more entries complaining about needing n bytes, but only having m available.

    Regarding record terminators: You're welcome to split your records into separate lines for easier management, but newlines don't have any meaning within JSON; that is they will be ignored on parse.

    If record #90764 is failing, what does JSONLint.com have to say about its syntax?

  • I'm not quite clear on what you mean by "tuple" in this case. This one parameter is the only one for handling rejections in the initial (Crane) Flex release, unfortunately.

    If a row contains duplicate tuples, then their key will be the same (or they're not duplicates), so this parameter will work in this case.

    If you're asking Vertica to refuse to load rows which already exist in the destination table, we unfortunately don't have plans to support this at this time, in part because this will be a Very Large operation requiring the entirety of the table to be read in for every load.
  • There are no lines with UDx in my .log.

    JSONLint.com is pretty happy with the syntax - check http://pastebin.com/N84nwzgv.

    Even vertica is happy with the syntax - see the post - extracting the line and few around, vertica imports them without a problem.

    But now I noticed another line in the log:
    [EE] <INFO> WOS Full; failover to DIRECT [Txn 45035996275050501, Projection public.neighborhood_super]
    Adding 'direct' to the copy command didn't help, though - only the message disappeared.
  • Regarding the newlines - I understand that - but it would be nice if the user can explicitly set record terminator to say '\n' - with the benefit that the import does not fail completely on the first parsing error. It is often the required outcome - load as much as possible.

    On the other hand I'm not sure that multiple objects following each other are valid json as well - so there has to be some kind of format extension to multi-record json. And this is where the newlines can have some meaning.
  • Regarding the EE log entry, this just means that you load was big enough that Vertica switched it to load directly into ROSes rather than going through WOSes. Specifying DIRECT just tells Vertica to not bother trying to load into WOSes, hence no log message informing you that it switched over.

    Regarding that row, I'm able to load it without problem. What version of Vertica are you using? (as reported by: /opt/vertica/bin/vertica --version) I'd take a look at the row directly following it (and potentially the one immediately preceding it).

    Thanks very much for the loading feedback--I like this pattern a lot and we'll definitely consider it. One concern I have is escaping: do you have any preferences for how we might deal with JSON values which need to contain newlines?
    Currently, we support two formats: 1. Multiple individual JSON object following each other which, as you identified, is not valid JSON taken together, even if the individual records are valid, for example {"A": 1}{"A": 2}{"A": 3}; 2. A single JSON array of JSON objects, which is valid JSON, for example [{"A": 1}{"A": 2}{"A": 3}]
  • Hi, excuse the delay, but ultimately I switched to tsv loader, because there was some problem even with a very simple case, where each JSON had 3 values - and vertica happily loaded 60K out of 100K objects, with no error message.

    I can try to provide the data later.

Leave a Comment

BoldItalicStrikethroughOrdered listUnordered list
Emoji
Image
Align leftAlign centerAlign rightToggle HTML viewToggle full pageToggle lights
Drop image/file