Building a Real-Time Streaming ETL Pipeline in 20 Minutes ETL
7 years ago | comments |
confluent.io | Score: 3
Reddit
Advanced ETL Processor Performance boost ETL
7 years ago | 5 comments |
etl-tools.com | Score: 2
Reddit
Latest Comment
We provide much more functionality comparing to SSIS,
Our customers very loyal,
we have users who switched their jobs and they still prefer to use our software instead of SSIS.
We are cheaper than Informatica or Alteryx,
it is difficult to compete with Pentaho but we like the challenge
7 years ago
Open Source Business Intelligence Tools ETL
7 years ago | comments |
blog.statsbot.co | Score: 5
Reddit
Best Top Informatica Interview Questions you should prepare ETL
7 years ago | 1 comment |
thetechdb.com | Score:
Reddit
Latest Comment
It would be MUCH more useful if the answers to the questions were also provided.
7 years ago
DataStage Interview Questions for Experienced – Must Prepare – Updated ETL
7 years ago | 1 comment |
thetechdb.com | Score: 2
Reddit
Latest Comment
Looks like a good comprehensive set of questions.
What do you think about also providing the answers?
7 years ago
Top 20 Advanced Abinitio Interview Questions and Answers ETL
7 years ago | comments |
thetechdb.com | Score:
Reddit
Release of Advanced ETL Processor Ent version 6 ETL
7 years ago | comments |
etl-tools.com | Score: 2
Reddit
A new parallel tool that does spatial ETL ETL
7 years ago | comments |
self.ETL | Score:
Reddit
Questions About Data Vault DWH Method (X-Post /R/BusinessIntelligence) ETL
7 years ago | comments |
self.ETL | Score: 3
Reddit
Data Model Design and Best Practices - Talend Blog ETL
7 years ago | comments |
talend.com | Score: 5
Reddit
Informatica Scenario based Interview Questions for Experienced ETL
7 years ago | comments |
thetechdb.com | Score: 1
Reddit
Best Top Informatica Interview Questions you should prepare ETL
7 years ago | 1 comment |
thetechdb.com | Score: 1
Reddit
Latest Comment
It'd be more helpful if the answers and explanation for the answers were also provided.
7 years ago
Top 5 Most Common Interview questions | Informatica ETL
7 years ago | comments |
thetechdb.com | Score:
Reddit
How to find missing data in etl (especially in big data)? ETL
7 years ago | 3 comments |
self.ETL | Score: 3
Reddit
Latest Comment
Finding missing data typically depends a bit on the nature of the data, but the approach that I liked the most involved Statistical Process Control (SPC). With SPC you can pretty effectively identify that data, or volumes of data, or distributions of data, that are consistently seen are now missing. I've been able to use this on a financial data warehouse - and stop the loads if the current set of data was inconsistent with its history.
As far as replacing missing data goes, it's most typically driven by teams with a more technical bent, often with statisticians or sophisticated analysts on the team. I think there's a few reasons for this:
- Statistical analysis has historically focused on using subsets of data, and only recently with machine learning & big data has shifted to using entire data populations more commonly.
- Statisticians feel the most comfortable with replacing missing data, and have the best methods & tools to handle it
- Non-statisticians typically feel uncomfortable with replacing missing data for reasons like - their reporting isn't couched in terms of probability or confidence, but in counts of actual, tangible things. Which doesn't lend itself to inventing data.
I find that extrapolation, interpolation, and regression are the most common from what I've seen. However, I'm not an expert. I wouldn't use a tool, I'd write code to add the missing data. And it wouldn't matter if it was in a database, log file, csv or json file - because, well, that's why it's handy to write code. Most statistical packages have numerous methods to use (SAS, SPSS, R, Python's Pandas library, etc).
Finally, I suggest tagging data that has had such drastic alterations to it so that in the future you can analyze, filter, etc this invented data.
EDIT: fixed missing paragraph
7 years ago
Can you trigger an Alteryx job from a SSIS package? ETL
7 years ago | 3 comments |
self.ETL | Score: 2
Reddit
Latest Comment
You can build flows as "macros" which you can then unit test completely independently of your source/destination databases (try that in SSIS..)
I believe you can do this in SSIS using project parameters and environment variables, specifically if you are calling the projects via a SQL Server job. You can use it for source/target connections, as well as values to pass into SQL clauses. More information here.
I'm not an SSIS expert, but I believe you can do many of the things in that list using SSIS. I'm not sure how the browse tool works in Alteryx, but it sounds like the data monitor in SSIS, for instance.
7 years ago
Trying to switch from hardware QA to ETL QA, need guidance (xpost r/cscareerquestions) ETL
7 years ago | 2 comments |
self.ETL | Score: 3
Reddit
Latest Comment
7 years ago
Versioning for Database Objects ETL
7 years ago | 5 comments |
self.ETL | Score: 8
Reddit
Latest Comment
So this tool is really being built from a developer's perspective. Initially, the first release will contain simple diagnostics using data that should be available from something like information_schema (if you have access to an object, you should have access to the objects data in information_schema). DBAs have other tools at their disposal that are generally RDBMS-specific with tons of bells and whistles which basically devs just don't need.
7 years ago
Machine learning in Talend - Decision Trees ETL
7 years ago | 1 comment |
help.talend.com | Score: 2
Reddit
Latest Comment
7 years ago
ETL Tool for Oracle cloud to MS SQL ETL
7 years ago | 9 comments |
self.ETL | Score: 2
Reddit
Latest Comment
Have you considered writing a custom tool in Go or Python? You can load a table extremely quickly and easily with Python.
7 years ago
looking for a best practices in etl page independent of platform or language. ETL
7 years ago | 5 comments |
self.ETL | Score: 1
Reddit
Latest Comment
Aside from Kimball's books, there isn't much. You can read some thoughts on the topic in this thread. Bill Inmon did actual ETL developers a huge disservice when he pushed people away from using custom code and towards expensive proprietary ETL suites. As a technology, ETL feels so far behind other areas because of the route it's taken.
7 years ago
Suggestions for good resources to find ETL/BI projects ETL
7 years ago | 1 comment |
self.ETL | Score: 3
Reddit
Latest Comment
Look for companies that are experiencing some change due to strong growth or other factors which will result in new processes, a need for better insights and the like:
early-stage start-ups, that have a good idea, maybe a few as-hoc reports but no time or resources to build up their analytics environment
companies that just had a funding round and can start new projects as well as spend money on new employees and contractors
companies that just went through a merger and need to consolidate their systems as fast as possible
7 years ago
Using Unicode for foreign languages - Any advice ETL
7 years ago | 2 comments |
self.ETL | Score: 2
Reddit
Latest Comment
Kanji does not have upper and lower case. They're not letters.
If I use a french keyboard and write a name, but another time use the same letters on a russian keyboard, will a lookup find the connection?
I don't expect it to as the strings are not the same.
I did some stuff in Korean. Best advice I can offer be sure to set everything to NVARCHAR/Unicode.
I'd suggest adding a field with a language key in it. Especially if you're concerned about functions affecting different languages differently.
7 years ago
Overview of input fields on Talend ETL
7 years ago | 1 comment |
self.ETL | Score: 2
Reddit
Latest Comment
I'm afraid I don't really understand your question but have you looked at the Data Quality perspective/studio?
7 years ago
[Discussion] Web application ETL design ETL
7 years ago | 8 comments |
self.ETL | Score: 5
Reddit
Latest Comment
I would use what I call the modular approach. The idea is to split task in different database parts (schema).
One module would be to store raw data, just a dump.
The second would be an archive module, where previous data are merged with the new one to find the diffs. If size of not that big (< somes GB on cheap hardware, more than 100GB on current hardaware), I would make a diff (md5 hash can be of huge help here) and not use timestamps and/or sequential PK. Never trust the source system if you don't have to.
Another module would be to prepare the data for reporting (on top of the second). A classic star schema or some precomputed view (or classic views). A fresh reload every time. It should be fast so no need to complicate things.
An finally, one last module for QA. You would log every data quality issue here.
With that, you have a fairly good audit trail if needed and atomicity of a database.
You can find more information on the modular approach here.
Postgresql can be really fast but don't do insert/update/delete when you don't have to. Use "create table as" and bulk insert (pgdump is Python don't support that directly). It's not as fast as an MPP database (far from it) but still.
Something really ELT then.
My experience make that I don't agree with kenfar but that's probably just because we are both good at our own way.
7 years ago
capitalone funded alternative to pentaho clover and talend? ETL
7 years ago | 3 comments |
github.com | Score: 5
Reddit
Latest Comment
7 years ago
“I choose you!” - criteria for selecting a data warehouse platform ETL
7 years ago | 1 comment |
blog.panoply.io | Score: 5
Reddit
Latest Comment
I'm actually pretty dissatisfied with the redshift instance options: you get either SSDs, which are great for concurrency, or rotating media, which is great for large volumes. But you can't get both, and you can't use EBS or S3.
Which means that if you want to support a lot of concurrency but also have large data volumes your cost can go through the roof. Which is unfortunate, and is absolutely way behind where commercial databases have been for the past twenty years. It also means that supposed cost advantages over a locally-hosted database server can evaporate pretty quickly.
Anyhow, would also like to hear about cost & performance comparisons of Redshift against Snowflake - when snowflake is configured to grow & shrink cluster sizes dynamically to that they're minimally-sized at night and on weekends, then sized up when needed. The last comparison I looked at didn't include this in the snowflake configuration - which is as bad as comparing bigquery against redshift without setting up sort & dist keys on the redshift tables.
7 years ago
Embarrassing questions, but I'm trying to classify my new job. Am I in the ETL world now? ETL
7 years ago | 2 comments |
self.ETL | Score: 3
Reddit
Latest Comment
These definitions tend to shift over time and overlap:
- Integration solutions are often more involved with connecting transactional systems through service layers.
- ETL solutions are traditionally involved with changed data capture, heavier transformations, aggregation and potentially complex pipelines. And in the big-data/data-science space it may involve a vast number of jobs, running complex analysis, and managing the execution dependencies between them.
But they overlap in the middle. And some solution providers, like these, appear to be trying to cover both.
7 years ago
ETL vs ELT: the difference is in the how ETL
7 years ago | comments |
panoply.io | Score: 1
Reddit
F# for Data Engineering • r/dataengineering ETL
7 years ago | comments |
reddit.com | Score: 1
Reddit
Which niches or vendor tools in the ETL world are more resource-constrained, where it's presently harder to fill open positions? (I could perhaps contribute to filling a gap :) ) ETL
7 years ago | 4 comments |
self.ETL | Score: 4
Reddit
Latest Comment
CloverETL - easy to learn and growing community, we started using it a while ago and its pretty nice - I see them being pretty large in the next few years, check em out.
7 years ago
Which niche areas or tools within ETL world are more resource-constrained, or having a harder time filling open positions? ETL
7 years ago | comments |
self.ETL | Score: 1
Reddit
11 Great ETL Tools, and the Case for Saying “No” to ETL ETL
7 years ago | comments |
panoply.io | Score: 4
Reddit
DataStage - Comparing Two Datasets ETL
7 years ago | 1 comment |
self.ETL | Score: 1
Reddit
Latest Comment
The only tool that automates a part of this that I am familiar with is Cozyroc's Table Diff SSIS component, but it's expensive, requires SSIS, and isn't very customizable. It's pretty quick to setup but is not completely dynamic, you still have to let it know the key(s), and what to look for.
I don't know any way of performing this dynamically, but I know how to do it pretty quickly. If I can get the table definitions I copy them into Excel, transpose them and format them into pairs:
select
tbl1.cola, tbl2.cola,
tbl1.colb, tbl2.colb,...
from tbl1
full outer join tbl2 --outer to capture missing records on either side
on tbl1.pk = tbl2.pk
Then I can run through the current record in pairs, stepping through the set by two, but grabbing attributes i and i+1. Then I run them through a few possibilities: i != i+1, i is NULL i+1 is not NULL, i is not NULL i+1 is NULL, etc.. For any caught, store the records/values in another table. This part can be reused, but the query has to be written each time.
I do the same thing to store deltas and history. I store the key, field name, source value (new), target value (old), the delta (if date or numeric), and the current datetime so I trigger events when specific attributes change, or change to a specific value.
7 years ago
[HIRING] Informatica Developer - 3 month Contract-to-hire - Orange County, CA or Atlanta GA ETL
7 years ago | comments |
self.ETL | Score: 4
Reddit
How to do ETL to Elasticsearch in Node ETL
7 years ago | 1 comment |
lessrework.com | Score: 1
Reddit
Latest Comment
Is this Aaron's new blog? We're looking to do a modernization effort on a fairly complicated form application. Sounds good that node and elasticstack work well together.
7 years ago
IPC for ETL components ETL
7 years ago | 2 comments |
self.ETL | Score: 2
Reddit
How good is the job market right now for Microsoft-specialized BI/ETL Developer? (SSIS, SSRS) ETL
7 years ago | 15 comments |
self.ETL | Score: 2
Reddit
Latest Comment
I'm looking around and I'd say the job market is decent depending on where you are/how mobile you are (though a lot of places are looking for 'bigdata/nosql' experience as well). Though I would think the job market in .NET devs is good (and for statisticians!), so I'm assuming this is out of interest instead of career optimization.
A someone also middle-aged: age is really only a potential issue in getting into tech firms, especially more cutting edge ones (who want to work you 60 hours a week). Frankly, those sorts of places are generally not using MS stack. The sort of places that are hiring SSIS/SSRS devs are usually more corporate tech departments (i.e. of retail orgs or logistic orgs, etc.), and age is not an issue for those.
7 years ago
1000+ CSV formatted files to be loaded. ETL
7 years ago | 20 comments |
self.ETL | Score: 3
Reddit
Latest Comment
7 years ago
performance issues with ETL in vertica ETL
7 years ago | 6 comments |
self.ETL | Score: 2
Reddit
Latest Comment
I have used Vertica and it comes with its own data loader. if you use that, it bloody fast. We tried to use Sqoop to load data into vertica and that was very slow. But with its native data loader, the data load was blazing fast.
7 years ago
Talend Online Training ETL
7 years ago | 1 comment |
mindmajix.com | Score: 4
Reddit
Pentaho BI Online Training | Online Pentaho BI Certification Course in USA, UK, Canada, Australia, Dubai, India ETL
7 years ago | 2 comments |
a1trainings.com | Score:
Reddit
SSIS vs T SQL ETL
7 years ago | 14 comments |
self.ETL | Score: 3
Reddit
Latest Comment
Many people posted good comments. I would add that it depends on how much "analytic" work you need to do with the data. SSIS is great at scheduling and very simple transformations but is limited in a more analytical (data science) approach. At my company, we use SSIS for big overnight jobs and Alteryx for data science related ETL.
7 years ago
Workflow management vs ETL 'Suite' ETL
7 years ago | 8 comments |
self.ETL | Score: 3
Reddit
Latest Comment
Hey, thanks for the replies! I guess this is where things get tricky. Up to now SSIS served up data 'as-is' meaning we provided the relevant relational data tables that an application required and left it to the business layer to deal with querying it anyway it may wish. But now, since we're working with Mongo we have a much stronger need to implement business logic within the ETL itself which should reside within the dev team and not be the DevOps or DBA's responsibility.
Currently we're using both a nodejs pipelines app and a .NET console app to perform the data load and extract from mongo. I guess, having looked at AirFlow a bit more, it would've been a better solution to go with individual linked tasks instead of two separate full-blown applications.. i wonder if my teammates know python.. :P
7 years ago
Advanced ETL Processor Enterprise ETL
7 years ago | 3 comments |
etl-tools.com | Score:
Reddit
Latest Comment
7 years ago
What is your data lineage strategy and tool(s) of choice? ETL
7 years ago | 7 comments |
self.ETL | Score: 2
Reddit
Latest Comment
Also one thing you should definitely consider: Is maintaining the lineage manually sustainable in the long run? The answer, considering any DWH except a toy one, tends to be an overwhelming NO. Note that I work for a vendor that automates data lineage from custom code so I may be biased, just so slightly :) But do give it a thought, you will thank yourself later. I am assuming here that when you say "home grown" you mean manual, which is the only sane explanation anyway ;-)
7 years ago
Talend Training and Certification Course Online in USA|UK|Canada|Australia|Dubai|India ETL
7 years ago | 4 comments |
a1trainings.com | Score:
Reddit
What cloud ETL tools are you using? ETL
7 years ago | 15 comments |
self.ETL | Score: 5
Reddit
Latest Comment
It looks like you've received some good recommendations here and I just want to suggest that you take a look at the peer reviews over at IT Central Station to see what people who have used these tools have to say about them.
For example, in a review of WhereScape RED (which is ranked by ITCS users as one of the top Data Integration and Access Tools) one Data Warehouse Manager for whom easy documentation was really important commented that "We have always struggled with maintaining documentation on our ETL. It is very nice to get source-to-target mappings, table diagrams, and dependency diagrams, which are up to date, with the click of a button." (you can see the rest of this review here if you want.)
So it's just really helpful to look at real-user reviews to get a good idea of what it's like to use these tools.
Here's a list of popular cloud data tools with reviews, which includes Matillion, Informatica Cloud and some of the other tools suggested here.
Hope this helps.
7 years ago
Idea for an ETL task ETL
7 years ago | 9 comments |
self.ETL | Score: 2
Reddit
Latest Comment
Performance depends on number of factors and there are physical limitations you will not be able to overcome.
Imagine that you have 100m records in DB2 source table.
So you run:
insert into sqlserver table select * from db2table
Performance depends on Db2 hard disk , network bandwidth and sql server hard disk.
It does not matter which tool you use you will not be able to make it any faster.
The way to make it faster is to transfer the data which was modified/added since last transfer.
It is only possible if source tables has last modification date field.
Mike
http://www.etl-tools.com/index.php
7 years ago
CloverETL? ETL
7 years ago | 2 comments |
self.ETL | Score: 4
Reddit
Latest Comment
Used it for a while, really did not like it. That said, we're on an in house, Python system using Airflow now because we prefer code + git over GUI + central ETL server.
7 years ago
Crab SQL for the filesystem, now for macOS & Windows (x-post from r/SQL) ETL
7 years ago | 3 comments |
etia.co.uk | Score: 6
Reddit
Latest Comment
7 years ago