Skip to main content
Filter by
Sorted by
Tagged with
1 vote
1 answer
45 views

I’m working with a time-series dataset where each record is supposed to be logged at 1-minute intervals. However, due to data quality issues, the dataset contains: duplicated timestamps missing ...
Kinjal Radadiya's user avatar
0 votes
0 answers
34 views

Im currently trying to implement Pydeequ for identifying anomalies in volumes for specific time periods, the problem is that pydeequ is picking up the latest entry from the metrics repository instead ...
polo1211's user avatar
0 votes
0 answers
47 views

I'm working with the Yahoo! Webscope dataset ydata-frontpage-todaymodule-clicks-v1_0 (specifically, the click logs for the first ten days in May 2009). The dataset description states that each user ...
amarchin's user avatar
  • 2,124
0 votes
1 answer
367 views

According to this documentation page, AWS Glue can now detect rows that failed a CustomSql data quality check. I tried it and I am not seeing the rows that failed but only a % of failed data. Here is ...
Haha's user avatar
  • 1,019
-1 votes
1 answer
51 views

Row 4&5 have same value in col C and also have same value in Col D (Correct) Row 6&7 have same value in Col C but different value in Col D (Incorrect) So all unique combination in ColA &B ...
user3585510's user avatar
0 votes
1 answer
97 views

Currently I have a table that saves data quality results (using Dataplex), in this table it leaves me a query to see the data that does not meet the quality rule. Example: In order to know which ...
Bastián SN's user avatar
1 vote
1 answer
110 views

I am working on DWH doing incremental load in staging from application database then doing quality checking and loading the data data in reporting schema with rows having flag as 0/1(for errors) using ...
Pritesh singh's user avatar
0 votes
1 answer
643 views

I'm developing a solution to make a data quality check in one column, and already used the rule expect_column_values_to_be_unique in many other columns like the following: df....
Lucas Mengual's user avatar
1 vote
2 answers
319 views

I have some data in SAS that I am performing QA on. I know I can output data to different tables using IF statements etc. What I want to do is output data to a table called 'error_data' if it fails a ...
Sproodle's user avatar
0 votes
1 answer
115 views

REDCap returns two associated discrepancies to the same rule. One shows that the values involver have no complete data ([no_data]) an the second one returns the case with the discrepancy that matches ...
user_pir's user avatar
0 votes
1 answer
825 views

I need to come up with data quality metrics for a project and how to measure them. I've been googling and reading and I understood that you can 'measure' the quality of data using the 6 dimensions (...
Alex's user avatar
  • 11
2 votes
1 answer
2k views

Regarding Great Expectations I want to create a custom expectation to validate if there are multiple unique observations of id_client based on a given id_product key in a DataFrame. After set up my ...
PeCaDe's user avatar
  • 478
3 votes
3 answers
862 views

I have a csv file with 8 columns. within the columns i purposely deleted some cells. When i tried to run a Glue DataQuality job, for IsComplete, the result passed (which is not supposed to) for one ...
khorjle's user avatar
  • 31
1 vote
1 answer
262 views

I have various .csv files. Each file has multiple columns. I am using the given code in R to pursue a quality check that for a particular column, how many rows have valid values and how many are null. ...
Michael_Brun's user avatar
1 vote
0 answers
636 views

I have a regular expression which is working perfectly fine in the Sheet view in Abinitio ExpressIT but I am trying to do the same in the Rules Grid / Grid view But I am not sure which function can I ...
JKC's user avatar
  • 2,628
1 vote
1 answer
1k views

I have implemented a data pipeline using autoloader bronze --> silver --> gold. now while I do this I want to perform some data quality checks, and for that I'm using great expectations library. ...
Chhaya Vishwakarma's user avatar
0 votes
2 answers
417 views

I'm trying to create a sheet to check the data quality from a survey in Google Sheets the document have this format: So basically I was using this formula =COUNTIF(B2:F2,"Don't know") to ...
user avatar
0 votes
0 answers
75 views

Looking for the most efficient way to check for nulls and have a desired output for a report. This is done in a Hadoop environment. For example, Database contains: FirstName LastName State John {null} ...
Supernova's user avatar
1 vote
1 answer
918 views

I am trying to run a great expectation suite on a delta table in Databricks. But I would want to run this on part of the table with a query. Though the validation is running fine, it's running on full ...
S.Dasgupta's user avatar
2 votes
1 answer
418 views

I'm trying to use Pydequu on Jupyter Notebook when i try to use ConstraintSuggestionRunner and show this error: Py4JJavaError: An error occurred while calling o70.run. : java.lang.NoSuchMethodError: '...
LuisRicardo's user avatar
1 vote
0 answers
562 views

Great Expectations creates temporary tables. I tried profiling data in my Snowflake lab. It worked because the role I was using could create tables in the schema that contained the tables I was ...
Alex Woolford's user avatar
1 vote
1 answer
924 views

I am implementing data quality checks using Great expectation library. does this library compatible with Pyspark does this run on multiple cores?
code_bug's user avatar
  • 415
0 votes
1 answer
883 views

I have a table with 60+ columns in it that I would like to UNPIVOT so that each column becomes a row and then find the fill rate, min value and max value of each entry. For Example ID START_DATE ...
user18623003's user avatar
0 votes
1 answer
97 views

In Talend Data Quality, I have configured a JDBC connection to an OpenEdge database and it's working fine. I can pull the list of tables and select columns to analyse, but when executing analysis, I ...
Sergei K.'s user avatar
0 votes
0 answers
65 views

I've not had to do any heavy lifting with Pandas until now, and now I've got a bit of a situation and can use some guidance. I've got some code that generates the following dataframe: ID_x HOST_NM ...
Magneto Optical's user avatar