Building a Real-Time Streaming ETL Pipeline in 20 Minutes  ETL

7 years ago | comments | confluent.io | Score: 3 Reddit


Advanced ETL Processor Performance boost  ETL

7 years ago | 5 comments | etl-tools.com | Score: 2 Reddit

Latest Comment

We provide much more functionality comparing to SSIS, Our customers very loyal, we have users who switched their jobs and they still prefer to use our software instead of SSIS. We are cheaper than Informatica or Alteryx, it is difficult to compete with Pentaho but we like the challenge


7 years ago


Open Source Business Intelligence Tools  ETL

7 years ago | comments | blog.statsbot.co | Score: 5 Reddit


Best Top Informatica Interview Questions you should prepare  ETL

7 years ago | 1 comment | thetechdb.com | Score: Reddit

Latest Comment

It would be MUCH more useful if the answers to the questions were also provided.


7 years ago


DataStage Interview Questions for Experienced – Must Prepare – Updated  ETL

7 years ago | 1 comment | thetechdb.com | Score: 2 Reddit

Latest Comment

Looks like a good comprehensive set of questions.

What do you think about also providing the answers?


7 years ago


Top 20 Advanced Abinitio Interview Questions and Answers  ETL

7 years ago | comments | thetechdb.com | Score: Reddit


Release of Advanced ETL Processor Ent version 6  ETL

7 years ago | comments | etl-tools.com | Score: 2 Reddit


A new parallel tool that does spatial ETL  ETL

7 years ago | comments | self.ETL | Score: Reddit


Questions About Data Vault DWH Method (X-Post /R/BusinessIntelligence)  ETL

7 years ago | comments | self.ETL | Score: 3 Reddit


Data Model Design and Best Practices - Talend Blog  ETL

7 years ago | comments | talend.com | Score: 5 Reddit


Informatica Scenario based Interview Questions for Experienced  ETL

7 years ago | comments | thetechdb.com | Score: 1 Reddit


Best Top Informatica Interview Questions you should prepare  ETL

7 years ago | 1 comment | thetechdb.com | Score: 1 Reddit

Latest Comment

It'd be more helpful if the answers and explanation for the answers were also provided.


7 years ago


Top 5 Most Common Interview questions | Informatica  ETL

7 years ago | comments | thetechdb.com | Score: Reddit


How to find missing data in etl (especially in big data)?  ETL

7 years ago | 3 comments | self.ETL | Score: 3 Reddit

Latest Comment

Finding missing data typically depends a bit on the nature of the data, but the approach that I liked the most involved Statistical Process Control (SPC). With SPC you can pretty effectively identify that data, or volumes of data, or distributions of data, that are consistently seen are now missing. I've been able to use this on a financial data warehouse - and stop the loads if the current set of data was inconsistent with its history.

As far as replacing missing data goes, it's most typically driven by teams with a more technical bent, often with statisticians or sophisticated analysts on the team. I think there's a few reasons for this:

  • Statistical analysis has historically focused on using subsets of data, and only recently with machine learning & big data has shifted to using entire data populations more commonly.
  • Statisticians feel the most comfortable with replacing missing data, and have the best methods & tools to handle it
  • Non-statisticians typically feel uncomfortable with replacing missing data for reasons like - their reporting isn't couched in terms of probability or confidence, but in counts of actual, tangible things. Which doesn't lend itself to inventing data.

I find that extrapolation, interpolation, and regression are the most common from what I've seen. However, I'm not an expert. I wouldn't use a tool, I'd write code to add the missing data. And it wouldn't matter if it was in a database, log file, csv or json file - because, well, that's why it's handy to write code. Most statistical packages have numerous methods to use (SAS, SPSS, R, Python's Pandas library, etc).

Finally, I suggest tagging data that has had such drastic alterations to it so that in the future you can analyze, filter, etc this invented data.

EDIT: fixed missing paragraph


7 years ago


Can you trigger an Alteryx job from a SSIS package?  ETL

7 years ago | 3 comments | self.ETL | Score: 2 Reddit

Latest Comment

You can build flows as "macros" which you can then unit test completely independently of your source/destination databases (try that in SSIS..)

I believe you can do this in SSIS using project parameters and environment variables, specifically if you are calling the projects via a SQL Server job. You can use it for source/target connections, as well as values to pass into SQL clauses. More information here.

I'm not an SSIS expert, but I believe you can do many of the things in that list using SSIS. I'm not sure how the browse tool works in Alteryx, but it sounds like the data monitor in SSIS, for instance.


7 years ago


Trying to switch from hardware QA to ETL QA, need guidance (xpost r/cscareerquestions)  ETL

7 years ago | 2 comments | self.ETL | Score: 3 Reddit

Latest Comment

Also posted on /r/database


7 years ago


Versioning for Database Objects  ETL

7 years ago | 5 comments | self.ETL | Score: 8 Reddit

Latest Comment

So this tool is really being built from a developer's perspective. Initially, the first release will contain simple diagnostics using data that should be available from something like information_schema (if you have access to an object, you should have access to the objects data in information_schema). DBAs have other tools at their disposal that are generally RDBMS-specific with tons of bells and whistles which basically devs just don't need.


7 years ago


Machine learning in Talend - Decision Trees  ETL

7 years ago | 1 comment | help.talend.com | Score: 2 Reddit


ETL Tool for Oracle cloud to MS SQL  ETL

7 years ago | 9 comments | self.ETL | Score: 2 Reddit

Latest Comment

Have you considered writing a custom tool in Go or Python? You can load a table extremely quickly and easily with Python.


7 years ago


looking for a best practices in etl page independent of platform or language.  ETL

7 years ago | 5 comments | self.ETL | Score: 1 Reddit

Latest Comment

Aside from Kimball's books, there isn't much. You can read some thoughts on the topic in this thread. Bill Inmon did actual ETL developers a huge disservice when he pushed people away from using custom code and towards expensive proprietary ETL suites. As a technology, ETL feels so far behind other areas because of the route it's taken.


7 years ago


Suggestions for good resources to find ETL/BI projects  ETL

7 years ago | 1 comment | self.ETL | Score: 3 Reddit

Latest Comment

Look for companies that are experiencing some change due to strong growth or other factors which will result in new processes, a need for better insights and the like:

  • early-stage start-ups, that have a good idea, maybe a few as-hoc reports but no time or resources to build up their analytics environment

  • companies that just had a funding round and can start new projects as well as spend money on new employees and contractors

  • companies that just went through a merger and need to consolidate their systems as fast as possible


7 years ago


Using Unicode for foreign languages - Any advice  ETL

7 years ago | 2 comments | self.ETL | Score: 2 Reddit

Latest Comment

Kanji does not have upper and lower case. They're not letters.

If I use a french keyboard and write a name, but another time use the same letters on a russian keyboard, will a lookup find the connection?

I don't expect it to as the strings are not the same.

I did some stuff in Korean. Best advice I can offer be sure to set everything to NVARCHAR/Unicode.

I'd suggest adding a field with a language key in it. Especially if you're concerned about functions affecting different languages differently.


7 years ago


Overview of input fields on Talend  ETL

7 years ago | 1 comment | self.ETL | Score: 2 Reddit

Latest Comment

I'm afraid I don't really understand your question but have you looked at the Data Quality perspective/studio?


7 years ago


[Discussion] Web application ETL design  ETL

7 years ago | 8 comments | self.ETL | Score: 5 Reddit

Latest Comment

I would use what I call the modular approach. The idea is to split task in different database parts (schema).

One module would be to store raw data, just a dump.

The second would be an archive module, where previous data are merged with the new one to find the diffs. If size of not that big (< somes GB on cheap hardware, more than 100GB on current hardaware), I would make a diff (md5 hash can be of huge help here) and not use timestamps and/or sequential PK. Never trust the source system if you don't have to.

Another module would be to prepare the data for reporting (on top of the second). A classic star schema or some precomputed view (or classic views). A fresh reload every time. It should be fast so no need to complicate things.

An finally, one last module for QA. You would log every data quality issue here.

With that, you have a fairly good audit trail if needed and atomicity of a database.

You can find more information on the modular approach here.

Postgresql can be really fast but don't do insert/update/delete when you don't have to. Use "create table as" and bulk insert (pgdump is Python don't support that directly). It's not as fast as an MPP database (far from it) but still.

Something really ELT then.

My experience make that I don't agree with kenfar but that's probably just because we are both good at our own way.


7 years ago


capitalone funded alternative to pentaho clover and talend?  ETL

7 years ago | 3 comments | github.com | Score: 5 Reddit

Latest Comment

hope it survives!


7 years ago


“I choose you!” - criteria for selecting a data warehouse platform  ETL

7 years ago | 1 comment | blog.panoply.io | Score: 5 Reddit

Latest Comment

I'm actually pretty dissatisfied with the redshift instance options: you get either SSDs, which are great for concurrency, or rotating media, which is great for large volumes. But you can't get both, and you can't use EBS or S3.

Which means that if you want to support a lot of concurrency but also have large data volumes your cost can go through the roof. Which is unfortunate, and is absolutely way behind where commercial databases have been for the past twenty years. It also means that supposed cost advantages over a locally-hosted database server can evaporate pretty quickly.

Anyhow, would also like to hear about cost & performance comparisons of Redshift against Snowflake - when snowflake is configured to grow & shrink cluster sizes dynamically to that they're minimally-sized at night and on weekends, then sized up when needed. The last comparison I looked at didn't include this in the snowflake configuration - which is as bad as comparing bigquery against redshift without setting up sort & dist keys on the redshift tables.


7 years ago


Embarrassing questions, but I'm trying to classify my new job. Am I in the ETL world now?  ETL

7 years ago | 2 comments | self.ETL | Score: 3 Reddit

Latest Comment

These definitions tend to shift over time and overlap:

  • Integration solutions are often more involved with connecting transactional systems through service layers.
  • ETL solutions are traditionally involved with changed data capture, heavier transformations, aggregation and potentially complex pipelines. And in the big-data/data-science space it may involve a vast number of jobs, running complex analysis, and managing the execution dependencies between them.

But they overlap in the middle. And some solution providers, like these, appear to be trying to cover both.


7 years ago


ETL vs ELT: the difference is in the how  ETL

7 years ago | comments | panoply.io | Score: 1 Reddit


F# for Data Engineering • r/dataengineering  ETL

7 years ago | comments | reddit.com | Score: 1 Reddit


Which niches or vendor tools in the ETL world are more resource-constrained, where it's presently harder to fill open positions? (I could perhaps contribute to filling a gap :) )  ETL

7 years ago | 4 comments | self.ETL | Score: 4 Reddit

Latest Comment

CloverETL - easy to learn and growing community, we started using it a while ago and its pretty nice - I see them being pretty large in the next few years, check em out.


7 years ago


Which niche areas or tools within ETL world are more resource-constrained, or having a harder time filling open positions?  ETL

7 years ago | comments | self.ETL | Score: 1 Reddit


11 Great ETL Tools, and the Case for Saying “No” to ETL  ETL

7 years ago | comments | panoply.io | Score: 4 Reddit


DataStage - Comparing Two Datasets  ETL

7 years ago | 1 comment | self.ETL | Score: 1 Reddit

Latest Comment

The only tool that automates a part of this that I am familiar with is Cozyroc's Table Diff SSIS component, but it's expensive, requires SSIS, and isn't very customizable. It's pretty quick to setup but is not completely dynamic, you still have to let it know the key(s), and what to look for.

I don't know any way of performing this dynamically, but I know how to do it pretty quickly. If I can get the table definitions I copy them into Excel, transpose them and format them into pairs:

select
tbl1.cola, tbl2.cola,
tbl1.colb, tbl2.colb,...
from tbl1
full outer join tbl2 --outer to capture missing records on either side
    on tbl1.pk = tbl2.pk

Then I can run through the current record in pairs, stepping through the set by two, but grabbing attributes i and i+1. Then I run them through a few possibilities: i != i+1, i is NULL i+1 is not NULL, i is not NULL i+1 is NULL, etc.. For any caught, store the records/values in another table. This part can be reused, but the query has to be written each time.

I do the same thing to store deltas and history. I store the key, field name, source value (new), target value (old), the delta (if date or numeric), and the current datetime so I trigger events when specific attributes change, or change to a specific value.


7 years ago


[HIRING] Informatica Developer - 3 month Contract-to-hire - Orange County, CA or Atlanta GA  ETL

7 years ago | comments | self.ETL | Score: 4 Reddit


How to do ETL to Elasticsearch in Node  ETL

7 years ago | 1 comment | lessrework.com | Score: 1 Reddit

Latest Comment

Is this Aaron's new blog? We're looking to do a modernization effort on a fairly complicated form application. Sounds good that node and elasticstack work well together.


7 years ago


IPC for ETL components  ETL

7 years ago | 2 comments | self.ETL | Score: 2 Reddit


How good is the job market right now for Microsoft-specialized BI/ETL Developer? (SSIS, SSRS)  ETL

7 years ago | 15 comments | self.ETL | Score: 2 Reddit

Latest Comment

I'm looking around and I'd say the job market is decent depending on where you are/how mobile you are (though a lot of places are looking for 'bigdata/nosql' experience as well). Though I would think the job market in .NET devs is good (and for statisticians!), so I'm assuming this is out of interest instead of career optimization.

A someone also middle-aged: age is really only a potential issue in getting into tech firms, especially more cutting edge ones (who want to work you 60 hours a week). Frankly, those sorts of places are generally not using MS stack. The sort of places that are hiring SSIS/SSRS devs are usually more corporate tech departments (i.e. of retail orgs or logistic orgs, etc.), and age is not an issue for those.


7 years ago


1000+ CSV formatted files to be loaded.  ETL

7 years ago | 20 comments | self.ETL | Score: 3 Reddit

Latest Comment

Odo - Moves data across containers (SQL, CSV, MongoDB, Pandas, etc). Claims to be the easiest and fastest way to load a CSV into your database.

https://github.com/blaze/odo

Documentation: http://odo.pydata.org/en/latest/overview.html

Also: https://github.com/pawl/awesome-etl


7 years ago


performance issues with ETL in vertica  ETL

7 years ago | 6 comments | self.ETL | Score: 2 Reddit

Latest Comment

I have used Vertica and it comes with its own data loader. if you use that, it bloody fast. We tried to use Sqoop to load data into vertica and that was very slow. But with its native data loader, the data load was blazing fast.


7 years ago


Talend Online Training  ETL

7 years ago | 1 comment | mindmajix.com | Score: 4 Reddit


Pentaho BI Online Training | Online Pentaho BI Certification Course in USA, UK, Canada, Australia, Dubai, India  ETL

7 years ago | 2 comments | a1trainings.com | Score: Reddit


SSIS vs T SQL  ETL

7 years ago | 14 comments | self.ETL | Score: 3 Reddit

Latest Comment

Many people posted good comments. I would add that it depends on how much "analytic" work you need to do with the data. SSIS is great at scheduling and very simple transformations but is limited in a more analytical (data science) approach. At my company, we use SSIS for big overnight jobs and Alteryx for data science related ETL.


7 years ago


Workflow management vs ETL 'Suite'  ETL

7 years ago | 8 comments | self.ETL | Score: 3 Reddit

Latest Comment

Hey, thanks for the replies! I guess this is where things get tricky. Up to now SSIS served up data 'as-is' meaning we provided the relevant relational data tables that an application required and left it to the business layer to deal with querying it anyway it may wish. But now, since we're working with Mongo we have a much stronger need to implement business logic within the ETL itself which should reside within the dev team and not be the DevOps or DBA's responsibility.

Currently we're using both a nodejs pipelines app and a .NET console app to perform the data load and extract from mongo. I guess, having looked at AirFlow a bit more, it would've been a better solution to go with individual linked tasks instead of two separate full-blown applications.. i wonder if my teammates know python.. :P


7 years ago


Advanced ETL Processor Enterprise  ETL

7 years ago | 3 comments | etl-tools.com | Score: Reddit

Latest Comment

bombastic support


7 years ago


What is your data lineage strategy and tool(s) of choice?  ETL

7 years ago | 7 comments | self.ETL | Score: 2 Reddit

Latest Comment

Also one thing you should definitely consider: Is maintaining the lineage manually sustainable in the long run? The answer, considering any DWH except a toy one, tends to be an overwhelming NO. Note that I work for a vendor that automates data lineage from custom code so I may be biased, just so slightly :) But do give it a thought, you will thank yourself later. I am assuming here that when you say "home grown" you mean manual, which is the only sane explanation anyway ;-)


7 years ago


Talend Training and Certification Course Online in USA|UK|Canada|Australia|Dubai|India  ETL

7 years ago | 4 comments | a1trainings.com | Score: Reddit


What cloud ETL tools are you using?  ETL

7 years ago | 15 comments | self.ETL | Score: 5 Reddit

Latest Comment

It looks like you've received some good recommendations here and I just want to suggest that you take a look at the peer reviews over at IT Central Station to see what people who have used these tools have to say about them. For example, in a review of WhereScape RED (which is ranked by ITCS users as one of the top Data Integration and Access Tools) one Data Warehouse Manager for whom easy documentation was really important commented that "We have always struggled with maintaining documentation on our ETL. It is very nice to get source-to-target mappings, table diagrams, and dependency diagrams, which are up to date, with the click of a button." (you can see the rest of this review here if you want.) So it's just really helpful to look at real-user reviews to get a good idea of what it's like to use these tools.

Here's a list of popular cloud data tools with reviews, which includes Matillion, Informatica Cloud and some of the other tools suggested here.

Hope this helps.


7 years ago


Idea for an ETL task  ETL

7 years ago | 9 comments | self.ETL | Score: 2 Reddit

Latest Comment

Performance depends on number of factors and there are physical limitations you will not be able to overcome.

Imagine that you have 100m records in DB2 source table.

So you run: insert into sqlserver table select * from db2table

Performance depends on Db2 hard disk , network bandwidth and sql server hard disk.

It does not matter which tool you use you will not be able to make it any faster.

The way to make it faster is to transfer the data which was modified/added since last transfer. It is only possible if source tables has last modification date field.

Mike http://www.etl-tools.com/index.php


7 years ago


CloverETL?  ETL

7 years ago | 2 comments | self.ETL | Score: 4 Reddit

Latest Comment

Used it for a while, really did not like it. That said, we're on an in house, Python system using Airflow now because we prefer code + git over GUI + central ETL server.


7 years ago


Crab SQL for the filesystem, now for macOS &amp; Windows (x-post from r/SQL)  ETL

7 years ago | 3 comments | etia.co.uk | Score: 6 Reddit

Latest Comment

DEUS VULT! hello


7 years ago