Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetimes with all-zero time component (exact midnight) are detected as dates #35

Closed
SteadBytes opened this issue Aug 7, 2020 · 5 comments
Labels

Comments

@SteadBytes
Copy link

Datetime values with all-zero time components are detected as dates by process.detect_types. This is because typetools.is_datetime explicitly checks that the time component is not '00:00:00'

has_time = converted and the_time != NULL_TIME
. I'd argue that if a time value is present at all then the value should be treated as a datetime as this likely better represents the intent of the source. For example, a database report may include a datetime column but all the values in a particular output happened to occur at midnight. The down-casting behaviour is similar to that of floats #34.

Example Test Case

diff --git a/tests/test_process.py b/tests/test_process.py
index cc538a2..f90556d 100644
--- a/tests/test_process.py
+++ b/tests/test_process.py
@@ -76,6 +76,12 @@ class Test:
         nt.assert_equal(Decimal('0.87'), result['confidence'])
         nt.assert_false(result['accurate'])
 
+    def test_detect_types_datetimes_midnight(self):
+        records = it.repeat({"foo": "2000-01-01 00:00:00"})
+        records, result = pr.detect_types(records)
+
+        nt.assert_equal(result["types"], [{"id": "foo", "type": "datetime"}])
+
     def test_fillempty(self):
         records = [
             {'a': '1', 'b': '27', 'c': ''},

Fails with:

'AssertionError: Lists differ: [{'id': 'foo', 'type': 'date'}] != [{'id': 'foo', 'type': 'datetime'}]

First differing element 0:
{'id': 'foo', 'type': 'date'}
{'id': 'foo', 'type': 'datetime'}

- [{'id': 'foo', 'type': 'date'}]
+ [{'id': 'foo', 'type': 'datetime'}]
?                             ++++

Making the time part non-zero will pass the test.

Potential Solutions

  • Prefer stricter type inference for datetimes by default; e.g. if it has both a date and a time field, it's a datetime.
  • Allow stricter type inference for datetimes as an option e.g. a kwarg to detect_types that is passed down to is_datetime to change the behaviour from "can this only be parsed as a datetime" to "this is a datetime"
  • Any other ideas of course!

I am happy to implement the changes required after a decision is made on the correct behaviour 😄

@reubano
Copy link
Owner

reubano commented Aug 10, 2020

Thanks for this, it seems like the best solution would be a NULL_TIME marker that isn't actually a real time. A similar bug may exist with NULL_YEAR. I'd be happy to review a PR for this.

@SteadBytes
Copy link
Author

Correct me if I'm wrong, but the intention of the NULL_TIME marker is to deliberately 'downcast' a possible datetime to a date when the time component is, from a time equality perspective at least, insignificant e.g. 00:00:00 is exactly at the start of day/midnight so it could be considered equivalent to a date by itself.

If that is correct, then my understanding is that in this use case the 'downcasting' behaviour is undesirable and as such NULL_TIME/NULL_YEAR perhaps aren't applicable concepts (this is similar to #34 ).

Does that make sense? 😄

@reubano
Copy link
Owner

reubano commented Aug 11, 2020

Not really, it's there just to give a value to the_time. So I think a value of 99:99:99 should work (similar to how NULL_YEAR = 9999 and marks times which don't have a year).

@SteadBytes
Copy link
Author

Ah, I think I understand (we both meant the same thing but were describing it differently I think). In the cases you have linked, would it perhaps be simpler just to return False in those branches? This would eliminate the need to try and pick a suitable 'null' value and, IMO, expresses the intent more clearly. I'm happy to put together a PR - would a similar approach be suitable for the other issue I linked?

@reubano
Copy link
Owner

reubano commented Dec 24, 2021

Working on this now. A bit trickier than I thought. The issue is that regular dates can be parsed to have a time of 00:00:00. So if NULL_TIME is just False or None then all dates will be seen as datetimes.

@reubano reubano added the bug label Dec 25, 2021
reubano added a commit that referenced this issue Dec 25, 2021
* fixes:
  Bump to version 0.44.0
  [DOCS] Update doc blocks and doctests
  [CHANGE] Reorder by_field guess_funcs to match by_value
  [FIX] Properly detect midnight times (closes #35)
  Allow pypy3 to fail CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants