Working around common data analysis challenges
There are several common challenges that can turn into problems for even the most experienced data miners when preparing their data for reporting. It's a best practice to review the following list with each new data set that you get because it will save you time when joining tables, summarizing data, and conducting other data mining activities. With a little data "massaging" and awareness of some of these items (converting field types, extracting fields that are completely blank, etc.) you can actually "prepare" the data for the report:
- improperly imported data field formats;
- querying and calculating numbers that have been imported as text or other format;
- querying and calculating dates and ranges that have been imported as text or other format;
- thinking that just because the query executes without error that it has run correctly;
- not using control totals;
- trying to accomplish too much in a single query (in MS Access);
- dealing with large data sets;
- ensuring CSV files import correctly by removing commas from data fields;
- desired data is a portion of data in an existing field or is broken into multiple fields (such as MMDDYY);
- queries are too complicated or long to perform singularly;
- sharing database analysis results with other people;
- queries are too numerous to easily keep track of (MS Access);
- compacting databases to save disk space;
- making sure you have sufficient hardware for analyzing large data sets including:
- a speedy processor, sufficient RAM to enhance processing speed and stability; and
enough data storage capacity to store on the hard-drive disk.
- not spot checking query output with expected output from originating table;
- only pulling the data you need by segmenting your request (such as by cost center or specific organization identifier) and avoiding the desire to "have it all"; and
- splitting raw data files using a text editor (MS Word) or another database product (SQL Server or Idea).
Procurement fraud queries
Using the "Find Duplicates" functionality, identify cases where a vendor may have submitted the same invoice twice. Sorting and grouping and then changing fields used for sorting and group. (Group by vendor, then sort first by date, then by amount, etc.)
Examine the data for same vendor, same amount, and same date (duplicate information to allow for vendor submitting the same invoice but changing the invoice number).
Examine the data for same vendor, same amount, and same invoice number (duplicate identification to allow for vendor submitting same invoice but changing the date).
Examine the data for same vendor and same amount, which will result in the largest resultant data set.ATF and Split Purchases
Examine the data for split purchases (just under a cut-off dollar amount) and multiple transactions on the same day to the same vendor within a few dollars of each other.
For after-the-fact purchases compare the purchase order date and the invoice date. Create an expression that calculates the number of days between the two dates.
Expense report fraud queries using either tool
Import the list of expenditures that are within $1 of a key authorization threshold. (For example, if an original receipt is required for all employee expense over $30, pull all transactions that are $29.00 to $29.99.)
Group by employee to see if certain employees routinely expense items just under the threshold using the summary functionality (group by, count, and sum), and calculate the frequency and value of transactions just below the authorization threshold.
Investigate employees that have a significantly higher number of transactions just falling below a key authorization threshold to understand especially if the item descriptions are inadequate to confirm such as multiple entries for "Taxi."
Payroll fraud queries using either tool
Import the list of all employee receiving salary from the last period.
Import the list of all authorized, current employees:Using the "Find Unmatched" functionality in the Query Wizard, follow the steps to identify employees that are receiving salary but aren't on the list of authorized employees.
Annualize the salary payments of the last payroll cycle for all employees excluding expense reports reimbursement, bonus payments, etc.
Join to the personnel table on employee number.
Compare the annualized salary calculation with the authorized salary amount.
Investigate any discrepancies.
Queries for gaps in check register using either tool
Sort the transactions by check number field, showing the vendor paid, date of payment, and amount paid.
Look for sequences that are missing or check numbers that don't comply with standard numbering technique.
Queries for dormant account activity using either tool
Review activities by accounts that may not be actively used.
Date of the last transaction for each account can be determined by applying the Maximum function with the data entry date.
Determine if dates of transactions only seem to occur during a short time period and research the account purpose.